« Previous - Version 2/3 (diff) - Next » - Current version
Dan Ibanez, 03/03/2015 05:35 pm

RPI Strategy

This page was written by Dan Ibanez to express in detail
the approach and concerns of the RPI group with respect
to portable performance.

Lightweight MPI

I did not coin this term, but it seems to refer to RPI's general
approach to message passing.
I will define this from three key parts:

  1. It supports all communication needs with provably good scaling
  2. It uses a very small subset of MPI1
  3. It can be used to pass messages between things that are not MPI ranks

Lightweight MPI is only a product insofar as it is implemented
in an open-source library called PCU, part of RPI's codes.
Saying MPI+LWMPI is misleading, because it is not an additional level
of parallelism, but rather an approach to message passing
that can transcend levels of parallelism.

The Golden Hammer for Message Passing

Our approach begins at a high level with the sparse data
exchange algorithm called "Non-Blocking Consensus" (NBX) in
this paper:

Hoefler, Torsten, Christian Siebert, and Andrew Lumsdaine.
"Scalable communication protocols for dynamic sparse data exchange."
ACM Sigplan Notices 45.5 (2010): 159-168.

What Hoefler et al. refer to as the Dynamic Sparse Data Exchange (DSDE)
problem shows up a lot in our unstructured codes,
in fact it describes all of our communication needs
(in addition to simple calls like all-reduce).

DSDE is quite simple to state: each rank has a set of messages
with known destinations. Exchange these messages and detect
when they all arrive.

The NBX algorithm is very likely the most efficient solution to this problem.
It is also really simple, but please refer to the paper for
a description.

Minimal Set of MPI Needs

The NBX algorithm solves the DSDE problem using non-blocking
point-to-point calls (NBP2P) and a non-blocking barrier (ibarrier).
NBP2P is available in all MPI implementations.
MPI3 offers an ibarrier, but RPI has implemented one
based on only non-blocking point-to-point calls.
MPI3 also offers neighborhood collectives and graph tools, the
pinnacle of which is MPI_NEIGHBOR_ALLTOALLW.
However, MPI_NEIGHBOR_ALLTOALLW is just an expression of a
restriced DSDE problem and can be solved by the NBX algorithm,
which in turn can be implemented with just MPI1 calls.

In short, everything that is needed from MPI2 or MPI3
can be created from MPI1 calls only.
Of course, the flip side is that the MPI3 versions can
be faster / hardware assisted, but again this
can only bring a constant factor improvement,
it does not change proofs of scalability.

Hybrid Message Passing

RPI went further in that we also implemented all-reduce
and barriers using only NBP2P, but that was just so
we only had to implement inter-thread NBP2P
to get hybrid parallelism in the whole program.

This in turn used MPI_THREAD_MULTIPLE, which people
smarter than me can attest is not optimally scalable within
the node, and may account for our
threading not surpassing MPI, but those same people
have still lost interest in threading after
trying more advanced methods.

Our method does scale as well as theoretically possible,
and is good enough to be used on up to 256K cores with
8 threads per MPI rank by PHASTA.
We have also gotten decent enough performance on
Intel Xeon Phi.

Also, note that the source of the threads is not restricted,
i.e. we could do this using OpenMP threads on manycore machines,
although currently we use pthreads.

Scalable Message Passing

In all of RPI's latest code, there is currently no message-passing
algorithm that is not provably scalable.
Load balancing and higher-level issues like that remain
open problems, and will remain so regardless of
programming model.

Of all the parallel code I have read, the two most common
barriers to scaling are non-scalable algorithms for
global numbering and for DSDE.
MPI_EXSCAN solves the former, and NBX solves the latter.

GPUs with OpenMP

From what I hear, MPI+OpenMP is the suggested model for using upcoming GPU-based
machines, even more so because theoretically it ports across to manycore machines also.
However, that portability is limited in practice to what OMP can actually transform.
This includes code containing for loops without carried dependencies
and code containing parallel sections that operate below the MPI level.
This "below the MPI level" is the key; there is no message passing in OMP.

RPI approached the same problem on manycore machines by implementing
message passing between the threads, and if such a thing could be done
again for GPU cores then it would might enable us to use GPUs through OpenMP.
Still, doing this naively sounds like it would not be very performant
(same story as threading, only worse).

Dispatching the easy-to-transform pieces of the algorithm to the GPU,
sounds like an easier approach.
This keeps the overall program at the host CPU level
and treats GPU usage as a subroutine call.
It also eases interoperability because libraries simply agree on the
lowest common denominator which is MPI on the host CPU and then
each code offloads the appropriate collectives.

For RPI, the challenge is in finding anything in our code which
can be transformed and offloaded to a GPU.