Distributed Memory
PSyclone supports the generation of code for distributed memory machines. When this option is switched on, PSyclone takes on responsibility for both performance and correctness, as described below.
Correctness
PSyclone is responsible for adding appropriate distributed memory communication calls to the PSy layer to ensure that the distributed memory code runs correctly. For example, a stencil operation will require halo exchanges between the different processes.
The burden of correctly placing distributed memory communication calls has traditionally been born by the user. However, PSyclone is able to determine the placing of these within the PSy-layer, thereby freeing the user from this responsibility. Thus, the Algorithm and Kernel code remain the same, irrespective of whether the target architecture does or does not require a distributed memory solution.
Performance
PSyclone adds HaloExchange and GlobalSum objects to the generated PSyIR InvokeSchedule at the required locations. The halo-exchange and global-sum objects are exposed here for the purposes of optimisation. For example the halo-exchange and/or global-sum objects may be moved in the schedule (via appropriate transformations) to enable overlap of computation with communication.
Note
When these optimisations are implemented, add a reference to the Transformations Section.
A halo exchange is required with distributed memory when a processor requires data from its halo and the halo information is out of date. One example is where a field is written to and then read using a stencil access. Halo exchanges have performance implications so should only be used where necessary.
A global sum is required with distributed memory when a scalar is written to. Global sums can have performance implications so should only be used where necessary. Global sums currently only occur in certain Built-in kernels. The description of Built-ins indicates when this is the case.
Implementation
Within the contents of an invoke()
call, PSyclone is able to
statically determine which communication calls are required and where
they should be placed. However, PSyclone has no information on what
happens outside invoke()
calls and thus is unable to statically
determine whether communication is required between these calls. The
solution we use is to add run-time flags in the PSy layer to keep
track of whether data has been written to and read from. These flags
are then used to determine whether communication calls are required upon
entry to an invoke()
.
Control
Support for distributed memory can be switched on or off with the
default being on. The default can be changed permanently by modifying
the DISTRIBUTED_MEMORY
variable in the psyclone.cfg
configuration
file to false
(see Configuration).
Distributed memory can be switched on or off from the psyclone
script using the -dm
/--dist_mem
or -nodm
/--no_dist_mem
flags, respectively.
For interactive access, the distributed memory option can be changed
interactively from the PSyFactory
class by setting the optional
distributed_memory
flag; for example:
psy = PSyFactory(api=api, distributed_memory=False)
Similarly the distributed memory option can be changed interactively
from the generate
function by setting the optional
distributed_memory
flag, e.g.:
psy, alg = generate("file.f90", distributed_memory=False).
Status
Distributed memory support is currently supported by the dynamo0.3
and
the gocean1p0
APIs. The remaining APIs ignore the distributed memory flag
and continue to produce code without any distributed memory functionality,
irrespective of its value.