Profiling
PSyclone has the ability to define regions that can be profiled with various performance measurement tools. The profiling can be enabled automatically using command line parameters like:
psyclone --profile kernels ...
Or, for finer-grained control, it may be applied via a profiling transformation within a transformation script.
PSyclone can be used with a variety of existing profiling tools. It currently supports dl_timer, TAU, Dr Hook, the NVIDIA GPU profiling tools and it comes with a simple stand-alone timer library. The PSyData API (see also the Developer Guide) is utilised to implement wrapper libraries that connect the PSyclone application to the profiling libraries. Certain adjustments to the application’s build environment are required:
The compiler needs to be able to find the module files for the wrapper of the selected profiling library.
The application needs to be linked with the wrapper library that interfaces between the PSyclone API and the tool-specific API.
The tool-specific library also needs to be linked in.
It is the responsibility of the user to supply the corresponding compiler command line options when building the application that incorporates the PSyclone-generated code.
Interface to Third Party Profiling Tools
PSyclone comes with wrapper libraries to support
usage of TAU, Dr Hook, dl_timer, NVTX (NVIDIA Tools Extension library),
and a simple non-thread-safe timing library. Support for further
profiling libraries will be added in the future. To compile the
wrapper libraries, change into the directory lib/profiling
of PSyclone and type make
to compile all wrappers. If only
some of the wrappers are required, you can either use
make wrapper-name
(e.g. make drhook
), or change
into the corresponding directory and use make
. The
corresponding README.md
files contain additional parameters
that can be set in order to find third party profiling tools.
Below are short descriptions of each of the various wrapper libraries that come with PSyclone:
lib/profiling/template
This is a simple library that just prints out the name as regions are entered and exited. It could act as a template to develop new wrapper libraries, hence its name.
lib/profiling/simple_timing
This is a simple, stand-alone library that uses Fortran system calls to measure the execution time, and reports average, minimum and maximum execution time for all regions. It is not MPI aware (i.e. it will just report independently for each MPI process), and not thread-safe.
lib/profiling/dl_timer
This wrapper uses the apeg-dl_timer library. In order to use this wrapper, you must download and install the dl_timer library from
https://bitbucket.org/apeg/dl_timer
. This library has various compile-time options and may be built with MPI or OpenMP support. Additional link options might therefore be required (e.g. enabling OpenMP, or linking with MPI).lib/profiling/tau
This wrapper uses TAU profiling and tracing toolkit. It can be downloaded from
https://www.cs.uoregon.edu/research/tau
.lib/profiling/drhook
This wrapper uses the Dr Hook library. You need to contact ECMWF to obtain a copy of Dr Hook.
lib/profiling/nvidia
This is a wrapper library that maps the PSyclone profiling API to the NVIDIA Tools Extension library (NVTX). This library is available from
https://developer.nvidia.com/cuda-toolkit
.lib/profiling/lfric_timer
This profile wrapper uses the timer functionality provided by LFRic, and it comes in two different versions:
libpsy_lfric_timer.a
This library just contains the PSyData wrapper, but not the actual timer code. It must therefore be linked with the LFRic infrastructure library. It is meant to be used by LFRic only.libpsy_lfric_timer_standalone.a
This library contains the LFRic timer object and its dependencies. It can be used standalone (i.e. without LFRic) with any program. A runnable example using a GOcean code is included inexamples/gocean/eg5/profile
.
The LFRic timer writes its output to a file called
timer.txt
in the current directory, and will overwrite this file if it should already exist.
Any user can create similar wrapper libraries for other profiling tools by providing a corresponding Fortran module. The functions that need to be implemented are described in the developer’s guide (psy_data).
Most libraries in lib/profiling
need to be linked in
with the corresponding 3rd party profiling tool, or use a compiler
wrapper provided by the tool which will provide the required additional
compiler parameters. The exceptions are the template and simple_timing
libraries, which are stand alone. The profiling example in
examples/gocean/eg5/profile
can be used with any of the
wrapper libraries (except nvidia
) to see how they work.
Required Modifications to the Program
In order to guarantee that any profiling library is properly initialised, PSyclone’s profiling wrappers utilise two additional function calls that the user must manually insert into the program:
profile_PSyDataInit()
This method needs to be called once to initialise the profiling tool. At this stage this call is not automatically inserted by PSyclone, so it is the responsibility of the user to add the call to an appropriate location in the application:
use profile_psy_data_mod, only : profile_PSyDataInit
...
call profile_PSyDataInit()
The “appropriate” location might depend on the profiling library used.
For example, it might be necessary to invoke this before or after
a call to MPI_Init()
.
profile_PSyDataShutdown()
At the end of the program the function profile_PSyDataShutdown()
must be called.
It will make sure that the measurements are printed, files are flushed,
and that the profiling tool is closed correctly. Again at
this stage it is necessary to manually insert the call at an appropriate
location:
use profile_psy_data_mod, only : profile_PSyDataShutdown
...
call profile_PSyDataShutdown()
And again the appropriate location might depend on the profiling library
used (e.g. before or after a call to MPI_Finalize()
).
Profiling Command-Line Options
PSyclone offers two command-line options to automatically instrument code with profiling regions. It can create profile regions around a full invoke routine (including all kernel calls in this invoke), and/or around each individual kernel (for the PSyKAl APIs ‘dynamo0.3’ and ‘gocean1.0’).
The option --profile invokes
will automatically add calls to
start and end a profile region at the beginning and end of every
invoke subroutine created by PSyclone. All kernels called within
this invoke subroutine will be included in the profiled region.
The option --profile routines
is a synonym for ‘invokes’ but is
provided as it is more intuitive for users who are transforming
existing code. (In this case, PSyclone will put a profiling region
around every routine that it processes.)
The option --profile kernels
will surround each outer loop
created by PSyclone with start and end profiling calls. Note that this
option is not available for the ‘nemo’ API as it does not have the
concept of explicit Kernels.
Note
In some APIs (for example LFRic
when using distributed memory) additional minor code might
get included in a profiled kernel section, for example
setDirty()
calls (expensive calls like HaloExchange
are excluded).
Note
If the kernels
option is used in combination with an
optimisation script that introduces OpenACC then profiling
calls are automatically excluded from within OpenACC
regions (since the PSyData wrappers are not compiled for
GPU execution).
Note
It is still the responsibility of the user to manually
add the calls to profile_PSyDataInit
and
profile_PSyDataShutdown
to the
code base (see Required Modifications to the Program).
PSyclone will modify the schedule of each invoke to insert the profiling regions. Below we show an example of a schedule created when instrumenting invokes - all children of a Profile-Node will be part of the profiling region, including all loops created by PSyclone and all kernel calls (note that for brevity, the nodes holding the loop bounds have been omitted for all but the first loop):
GOInvokeSchedule[invoke='invoke_1']
0: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_cu',it_space='go_internal_pts']
Literal[value:'2']
Literal[value:'jstop']
Literal[value:'1']
Schedule[]
0: Loop[type='inner',field_space='go_cu',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_unew_code(unew_fld,uold_fld,z_fld,
cv_fld,h_fld,tdt,dy) [module_inline=False]
1: Loop[type='outer',field_space='cv',it_space='internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='cv',it_space='internal_pts']
...
Schedule[]
0: CodedKern compute_vnew_code(vnew_fld,vold_fld,z_fld,
cu_fld,h_fld,tdt,dy) [module_inline=False]
2: Loop[type='outer',field_space='ct',it_space='internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='ct',it_space='internal_pts']
...
Schedule[]
0: CodedKern compute_pnew_code(pnew_fld,pold_fld,cu_fld,
cv_fld,tdt,dx,dy) [module_inline=False]
And now the same schedule when instrumenting kernels. In this case each loop nest and kernel call will be contained in a separate region:
GOInvokeSchedule[invoke='invoke_1']
0: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_cu',it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_cu',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_unew_code(unew_fld,uold_fld,z_fld,
cv_fld,h_fld,tdt,dy) [module_inline=False]
1: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_cv',it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_cv',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_vnew_code(vnew_fld,vold_fld,z_fld,
cu_fld,h_fld,tdt,dy) [module_inline=False]
2: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_ct',it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_ct',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_pnew_code(pnew_fld,pold_fld,
cu_fld,cv_fld,tdt,dx,dy) [module_inline=False]
Both options can be specified at the same time:
GOInvokeSchedule[invoke='invoke_1']
0: [Profile]
Schedule[]
0: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_cu',
it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_cu',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_unew_code(unew_fld,uold_fld,
...) [module_inline=False]
1: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_cv',
it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_cv',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_vnew_code(vnew_fld,vold_fld,
...) [module_inline=False]
2: [Profile]
Schedule[]
0: Loop[type='outer',field_space='go_ct',
it_space='go_internal_pts']
...
Schedule[]
0: Loop[type='inner',field_space='go_ct',
it_space='go_internal_pts']
...
Schedule[]
0: CodedKern compute_pnew_code(pnew_fld,pold_fld,
...) [module_inline=False]
Profiling in Scripts - ProfileTrans
The greatest flexibility is achieved by using the profiler transformation explicitly in a transformation script. The script takes either a single PSyIR Node or a list of PSyIR Nodes as argument, and will insert a Profile Node into the PSyIR, with the specified nodes as children. At code creation time the listed children will all be enclosed in one profile region. As an example:
from psyclone.psyir.transformations import ProfileTrans
p_trans = ProfileTrans()
schedule = psy.invokes.get('invoke_0').schedule
print(schedule.view())
# Enclose some children within a single profile region
p_trans.apply(schedule.children[1:3])
print(schedule.view())
The profiler transformation also allows the profile name to be set explicitly, rather than being automatically created (see Naming Profiling Regions for details). This allows for potentially more intuitive names or finer grain control over profiling (as particular regions could be provided with the same profile names). For example:
invoke = psy.invokes.invoke_list[0]
schedule = invoke.schedule
profile_trans = ProfileTrans()
# Use the actual psy-layer module and subroutine names.
options = {"region_name": (psy.name, invoke.name)}
profile_trans.apply(schedule.children, options=options)
# Use own names and repeat for different regions to aggregate profile.
options = {"region_name": ("my_location", "my_region")}
profile_trans.apply(schedule[0].children[1:2], options=options)
profile_trans.apply(schedule[0].children[5:7], options=options)
Warning
If “region_name” is misspelt in the options dictionary then the option will be silently ignored. This is true for all options. Issue #613 captures this problem.
Warning
It is the responsibility of the user to make sure that a profile region is only created inside a multi-threaded region if the profiling library used is thread-safe!
Naming Profiling Regions
A profile region derives its name from two components:
- module_name
A string identifying the psy-layer containing this profile node.
- region_name
A string identifying the invoke containing this profile node and its location within the invoke (where necessary).
By default PSyclone will generate appropriate names to uniquely determine a particular region. Since those names can be somewhat cryptic, alternative names can be specified by the user when adding profiling via a transformation script, see Passing Parameters From the User to the Node Constructor.
The automatic name generation depends on the API according to the following rules:
For the NEMO API,
the module_name string is set to the name of the parent function/subroutine/program. This name is unique as Fortran requires these names to be unique within a program.
the region_name is set to an r (standing for region) followed by an integer which uniquely identifies the profile within the parent function/subroutine/program (based on the profile node’s position in the PSyIR representation relative to any other profile nodes).
For the LFRic (Dynamo0.3) and GOcean1.0 APIs,
the module_name string is set to the module name of the generated PSy-layer. This name should be unique by design (otherwise module names would clash when compiling).
the region_name is set to the name of the invoke in which it resides, followed by a : and a kernel name if the profile region contains a single kernel, and is completed by :r (standing for region) followed by an integer which uniquely identifies the profile within the invoke (based on the profile node’s position in the PSyIR representation relative to any other profile nodes). For example:
InvokeSchedule[invoke='invoke_0', dm=True] 0: Profile[] Schedule[] 0: Profile[] Schedule[] 0: HaloExchange[field='f2', type='region', depth=1, check_dirty=True] 1: HaloExchange[field='m1', type='region', depth=1, check_dirty=True] 2: HaloExchange[field='m2', type='region', depth=1, check_dirty=True] 1: Profile[] Schedule[] 0: Loop[type='', field_space='w1', it_space='cells', upper_bound='cell_halo(1)'] Literal[value:'1', DataType.INTEGER] Literal[value:'mesh%get_last_halo_cell(1)', DataType.INTEGER] Literal[value:'1', DataType.INTEGER] Schedule[] 0: CodedKern testkern_code(a,f1,f2,m1,m2) [module_inline=False] 1: Profile[] Schedule[] 0: Loop[type='', field_space='w1', it_space='cells', upper_bound='cell_halo(1)'] Literal[value:'1', DataType.INTEGER] Literal[value:'mesh%get_last_halo_cell(1)', DataType.INTEGER] Literal[value:'1', DataType.INTEGER] Schedule[] 0: CodedKern testkern_code(a,f1,f2,m1,m2) [module_inline=False] 2: Loop[type='', field_space='w1', it_space='cells', upper_bound='cell_halo(1)'] Literal[value:'1', DataType.INTEGER] Literal[value:'mesh%get_last_halo_cell(1)', DataType.INTEGER] Literal[value:'1', DataType.INTEGER] Schedule[] 0: CodedKern testkern_qr_code(f1,f2,m1,a,m2,istp) [module_inline=False]
This is the code created for this example:
MODULE container
CONTAINS
SUBROUTINE invoke_0(a, f1, f2, m1, m2, istp, qr)
...
CALL psy_data_3%PreStart("multi_functions_multi_invokes_psy", "invoke_0:r0", &
0, 0)
CALL psy_data%PreStart("multi_functions_multi_invokes_psy", "invoke_0:r1", 0, 0)
IF (f2_proxy%is_dirty(depth=1)) THEN
CALL f2_proxy%halo_exchange(depth=1)
END IF
IF (m1_proxy%is_dirty(depth=1)) THEN
CALL m1_proxy%halo_exchange(depth=1)
END IF
IF (m2_proxy%is_dirty(depth=1)) THEN
CALL m2_proxy%halo_exchange(depth=1)
END IF
CALL psy_data%PreEnd()
CALL psy_data_1%PreStart("multi_functions_multi_invokes_psy", "invoke_0:r2", &
0, 0)
DO cell=1,mesh%get_last_halo_cell(1)
CALL testkern_code(...)
END DO
...
CALL psy_data_2%PreStart("multi_functions_multi_invokes_psy", &
"invoke_0:testkern_code:r3", 0, 0)
DO cell=1,mesh%get_last_halo_cell(1)
CALL testkern_code(...)
END DO
...
CALL psy_data_2%PostEnd()
CALL psy_data_1%PostEnd()
...
DO cell=1,mesh%get_last_halo_cell(1)
CALL testkern_qr_code(...)
END DO
...
CALL psy_data_3%PostEnd()
...
END SUBROUTINE invoke_0
END MODULE container