PSy Kernel Extractor (PSyKE)#

Introduction#

PSyclone has the ability to define regions of a PSyclone-conformant code to be extracted and run as a stand-alone application. This ability, called PSyKE (PSy Kernel Extractor), can be useful for benchmarking parts of a model, such as LFRic, without the need for using its infrastructure.

Usage#

The code extraction is currently enabled by utilising an ExtractTrans transformation in a user script (see PSyclone User Scripts section for more details). The basic mechanism of code extraction is through applying the ExtractTrans transformation to selected Nodes. This transformation is further sub-classed into API-specific implementations, LFRicExtractTrans and GOceanExtractTrans. Both sub-classed transformations insert an instance of the ExtractNode object into the Schedule of a specific Invoke. All Nodes marked for extraction become children of the ExtractNode. For example, the transformation script which extracts the first Kernel call in LFRic API test example 15.1.2_builtin_and_normal_kernel_invoke.f90 would be written as:

from psyclone.domain.lfric.transformations import LFRicExtractTrans

# Get instance of the ExtractRegionTrans transformation
etrans = LFRicExtractTrans()

# Get Invoke and its Schedule
invoke = psy.invokes.get("invoke_0")
schedule = invoke.schedule

# Apply extract transformation to the selected Node
etrans.apply(schedule.children[2])
print(schedule.view())

and called as:

> psyclone -nodm -s ./extract_single_node.py \
    <path-to-example>/15.1.2_builtin_and_normal_kernel_invoke.f90

PSyclone modifies the Schedule of the selected invoke_0:

Schedule[invoke='invoke_0' dm=False]
    0: Loop[type='dofs',field_space='any_space_1',it_space='dofs',
            upper_bound='ndofs']
        Reference[name:'loop0_start']
        Reference[name:'loop0_stop']
        Literal[value:'1']
        Schedule[]
            0: BuiltIn setval_c(f5,0.0)
    1: Loop[type='dofs',field_space='any_space_1',it_space='dofs',
            upper_bound='ndofs']
        ...
        Schedule[]
            0: BuiltIn setval_c(f2,0.0)
    2: Loop[type='',field_space='w2',it_space='cells', upper_bound='ncells']
        ...
        Schedule[]
            0: CodedKern testkern_code_w2_only(f3,f2) [module_inline=False]
    3: Loop[type='',field_space='wtheta',it_space='cells', upper_bound='ncells']
        ...
        Schedule[]
            0: CodedKern testkern_wtheta_code(f4,f5) [module_inline=False]
    4: Loop[type='',field_space='w1',it_space='cells', upper_bound='ncells']
        ...
        Schedule[]
            0: CodedKern testkern_code(scalar,f1,f2,f3,f4) [module_inline=False]

to insert the extract region. As shown below, all children of an ExtractNode will be part of the region:

Schedule[invoke='invoke_0' dm=False]
    0: Loop[type='dofs',field_space='any_space_1',it_space='dofs',
            upper_bound='ndofs']
        ...
        Schedule[]
            0: BuiltIn setval_c(f5,0.0)
    1: Loop[type='dofs',field_space='any_space_1',it_space='dofs',
            upper_bound='ndofs']
        ...
        Schedule[]
            0: BuiltIn setval_c(f2,0.0)
    2: Extract
        Schedule[]
            0: Loop[type='',field_space='w2',it_space='cells', upper_bound='ncells']
                ...
                Schedule[]
                    0: CodedKern testkern_code_w2_only(f3,f2) [module_inline=False]
    3: Loop[type='',field_space='wtheta',it_space='cells', upper_bound='ncells']
        ...
        Schedule[]
            0: CodedKern testkern_wtheta_code(f4,f5) [module_inline=False]
    4: Loop[type='',field_space='w1',it_space='cells', upper_bound='ncells']
        ...
        Schedule[]
            0: CodedKern testkern_code(scalar,f1,f2,f3,f4) [module_inline=False]

To extract multiple Nodes, ExtractTrans can be applied to the list of Nodes (subject to General restrictions above).

# Apply extract transformation to the selected Nodes
etrans.apply(schedule.children[1:3])

This modifies the above Schedule as:

...
    Extract
        Schedule[]
            0: Loop[type='dofs',field_space='any_space_1',it_space='dofs',
                    upper_bound='ndofs']
                ...
                Schedule[]
                    0: BuiltIn setval_c(f2,0.0)
            1: Loop[type='',field_space='w2',it_space='cells', upper_bound='ncells']
                ...
                Schedule[]
                    0: CodedKern testkern_code_w2_only(f3,f2) [module_inline=False]
...

The ExtractNode class uses the dependency analysis to detect which variables are input-, and which ones are output-parameters. The lists of variables are then passed to the PSyDataNode, which is the base class of any ExtractNode (details of the PSyDataNode can be found in The PSyData Transformations). This node then creates the actual code, as in the following LFRic example:

! ExtractStart
!
CALL extract_psy_data%PreStart("testkern_mod", "testkern_code", 4, 2)
CALL extract_psy_data%PreDeclareVariable("a", a)
CALL extract_psy_data%PreDeclareVariable("f2", f2)
CALL extract_psy_data%PreDeclareVariable("m1", m1)
CALL extract_psy_data%PreDeclareVariable("m2", m2)
CALL extract_psy_data%PreDeclareVariable("map_w1", map_w1)
...
CALL extract_psy_data%PreDeclareVariable("undf_w3", undf_w3)
CALL extract_psy_data%PreDeclareVariable("f1_post", f1)
CALL extract_psy_data%PreDeclareVariable("cell_post", cell)
CALL extract_psy_data%PreEndDeclaration
CALL extract_psy_data%ProvideVariable("a", a)
CALL extract_psy_data%ProvideVariable("f2", f2)
CALL extract_psy_data%ProvideVariable("m1", m1)
CALL extract_psy_data%ProvideVariable("m2", m2)
CALL extract_psy_data%ProvideVariable("map_w1", map_w1)
...
CALL extract_psy_data%ProvideVariable("undf_w3", undf_w3)
CALL extract_psy_data%PreEnd
DO cell=1,f1_proxy%vspace%get_ncell()
  !
  CALL testkern_code(nlayers, a, f1_proxy%data, f2_proxy%data,  &
       m1_proxy%data, m2_proxy%data, ndf_w1, undf_w1,           &
       map_w1(:,cell), ndf_w2, undf_w2, map_w2(:,cell), ndf_w3, &
       undf_w3, map_w3(:,cell))
END DO
CALL extract_psy_data%PostStart
CALL extract_psy_data%ProvideVariable("cell_post", cell)
CALL extract_psy_data%ProvideVariable("f1_post", f1)
CALL extract_psy_data%PostEnd
!
! ExtractEnd

The PSyData API relies on generic Fortran interfaces to provide the field-type-specific implementations of the ProvideVariable for different types. This means that a different version of the external PSyData library that PSyKE uses must be supplied for each PSyclone API.

As said above, extraction can be performed on optimised code. For example, the following example transformation script first adds !$OMP PARALLEL DO directive and then extracts the optimised code in LFRic API test example 15.1.2_builtin_and_normal_kernel_invoke.f90:

from psyclone.domain.lfric.transformations import LFRicExtractTrans
from psyclone.transformations import LFRicOMPParallelLoopTrans

# Get instances of the transformations
etrans = LFRicExtractTrans()
otrans = LFRicOMPParallelLoopTrans()

# Get Invoke and its Schedule
invoke = psy.invokes.get("invoke_0")
schedule = invoke.schedule

# Add OMP PARALLEL DO directives
otrans.apply(schedule.children[1])
otrans.apply(schedule.children[2])
# Apply extract transformation to the selected Nodes
etrans.apply(schedule.children[1:3])
print(schedule.view())

The generated code is now:

! ExtractStart
CALL extract_psy_data%PreStart("unknown-module", "setval_c", 0, 4)
CALL extract_psy_data%PreDeclareVariable("cell_post", cell)
CALL extract_psy_data%PreDeclareVariable("df_post", df)
CALL extract_psy_data%PreDeclareVariable("f2_post", f2)
CALL extract_psy_data%PreDeclareVariable("f3_post", f3)
...
CALL extract_psy_data%PreEndDeclaration
...
CALL extract_psy_data%PreEnd
!
!$omp parallel do default(shared), private(df), schedule(static)
DO df=1,undf_aspc1_f2
  f2_proxy%data(df) = 0.0
END DO
!$omp end parallel do
!$omp parallel do default(shared), private(cell), schedule(static)
DO cell=1,f3_proxy%vspace%get_ncell()
  !
  CALL testkern_code_w2_only(nlayers, f3_proxy%data, f2_proxy%data, ndf_w2, undf_w2, map_w2(:,cell))
END DO
!$omp end parallel do
CALL extract_psy_data%PostStart
CALL extract_psy_data%ProvideVariable("cell_post", cell)
CALL extract_psy_data%ProvideVariable("df_post", df)
CALL extract_psy_data%ProvideVariable("f2_post", f2)
CALL extract_psy_data%ProvideVariable("f3_post", f3)
CALL extract_psy_data%PostEnd
!
! ExtractEnd

Examples in examples/lfric/eg12 directory demonstrate how to apply code extraction by utilising PSyclone transformation scripts (see LFRic Examples section for more information). The code in examples/lfric/eg17/full_example_extract can be compiled and run, and it will create two kernel data files.

Restrictions#

Code extraction can be applied to unoptimised or optimised code. There are restrictions that check for correctness of optimising transformations when extraction is applied, as well as restrictions that eliminate dependence on the specific model infrastructure.

General#

This group of restrictions is enforced irrespective of whether optimisations are used or not.

Extraction can be applied to a single Node or a list of Nodes in a Schedule. For the latter, Nodes in the list must be consecutive children of the same parent Schedule.
Extraction cannot be applied to an ExtractNode or a Node list that already contains one (otherwise we would have an extract region within another extract region).
A Kernel or a Built-In call cannot be extracted without its parent Loop.
The extraction code will now write variables that are used from other modules to the kernel data file, and the driver will read these values in. Unfortunately, if a variable is used that is defined as private or protected, the value cannot be written to the file, and compilation will abort. The only solution is to modify this file and make all variables public.
- The new build system FAB will be able to remove private and protected declarations in any source files, meaning no manual modification of files is required anymore (TODO #2536).

Distributed memory#

Kernel extraction for distributed memory is supported in as much as each process will write its own output file by adding its rank to the output file name. So each kernel and each rank will produce one file. It is possible to extract several consecutive kernels, but there must be no halo exchange calls between the kernels. The extraction transformation will test for this and raise an exception if this should happen. The compiled driver program accepts the name of the extracted kernel file as a command line parameter. If this is not specified, it will use the default name (module-region without a rank).

Shared memory and API-specific#

The ExtractTrans transformation cannot be applied to:

A Loop without its parent Directive,
An orphaned Directive (e.g. OMPDoDirective, ACCLoopDirective) without its parent Directive (e.g. ACC or OMP Parallel Directive),
A Loop over cells in a colour without its parent Loop over colours in the LFRic API,
An inner Loop without its parent outer Loop in the GOcean API.
Kernels that have a halo exchange call between them.

Extraction Libraries#

PSyclone comes with three extraction libraries:

one is based on NetCDF and will create NetCDF files which contain all input- and output-parameters.
the second one is a stand-alone library which uses only standard unformatted Fortran binary IO to write and read kernel data. The binary files produced using this library may not be portable between machines and compilers.
the last version is a stand-alone library which writes the data as ASCII files. While this is supposed to be very general, some compilers do not write sufficient digits for floating point numbers to reproduce the exact same binary representation. This can show up as small errors reported when running the drivers, even for trivial operations like x-y.

The best option for portability across different compilers and different hardware is the NetCDF extraction library.

The three extraction libraries are in lib/extract/binary, lib/extract/ascii, and in lib/extract/netcdf.

All versions of the extraction libraries can be compiled with MPI support by setting the variable MPI=yes:

make MPI=yes ...

The only difference is that the output files will now have the process rank in the name. The compiled driver program accepts the name of the extracted kernel file as a command line parameter. If this is not specified, it will use the default name (module-region without a rank).

Extraction for GOcean#

The extraction libraries in lib/extract/binary/dl_esm_inf, lib/extract/ascii/dl_esm_inf and lib/extract/netcdf/dl_esm_inf implement the full PSyData API for use with the GOcean dl_esm_inf infrastructure library. When running the instrumented executable, it will create a corresponding data file for each instrumented code region. It includes all variables that are read before the code is executed, and all variables that have been modified. The output variables have the postfix _post attached to the names, e.g. a variable xyz that is read and written will be stored with the name xyz containing the input values, and the name xyz_post containing the output values. Arrays have their size explicitly stored (in case of NetCDF as dimensions): again the variable xyz will have its sizes stored as xyzdim1, xyzdim2 for the input values, and output arrays use the name xyz_postdim1, xyz_postdim2.

Note

The stand-alone libraries do not store the names of the variables in the output file, but will match the variable names in the created driver.

The output file contains the values of all variables used in the subroutine. The GOceanExtractTrans transformation can automatically create a driver program which will read the corresponding output file, call the instrumented region, and compare the results. In order to create this driver program, the options parameter create_driver must be set to true:

extract = GOceanExtractTrans()
extract.apply(schedule.children,
              {"create_driver": True,
               "region_name": ("main", "init")})

This will create a Fortran file called driver-main-init.f90, which can then be compiled and executed. This stand-alone program will read the output file created during an execution of the actual program, call the kernel with all required input parameter, and compare the output variables with the original output variables. This can be used to create stand-alone test cases to reproduce a bug, or for performance optimisation of a stand-alone kernel.

Warning

Care has to be taken that the driver matches the version of the code that was used to create the output file, otherwise the driver will likely crash. The stand-alone driver relies on a strict ordering of variable values in the output file and e.g. even renaming one variable can affect this. The NetCDF version stores the variable names and will not be able to find a variable if its name has changed.

Extraction for LFRic#

The libraries in lib/extract/binary/lfric, lib/extract/ascii/lfric and lib/extract/netcdf/lfric implement the full PSyData API for use with the LFRic infrastructure library. When running the code, it will create an output file for each instrumented code region. The same logic for naming variables (using _post for output variables) used in Extraction for GOcean is used here.

Check Integrating PSyData Libraries into the LFRic Build Environment for the recommended way of linking an extraction library to LFRic.

The output file contains the values of all variables used in the subroutine. The LFRicExtractTrans transformation can automatically create a driver program which will read the corresponding output file, call the instrumented region, and compare the results. In order to create this driver program, the options parameter create_driver must be set to true:

extract = LFRicExtractTrans()
extract.apply(schedule.children,
              {"create_driver": True,
               "region_name": ("main", "init")})

This will create a Fortran file called driver-main-init.F90, which can then be compiled and executed. This stand-alone program will read the output file created during an execution of the actual program, call the kernel with all required input parameter, and compare the output variables with the original output variables. This can be used to create stand-alone test cases to reproduce a bug, or for performance optimisation of a stand-alone kernel.

Warning

Care has to be taken that the driver matches the version of the code that was used to create the output file, otherwise the driver will likely crash. The stand-alone drivers (both ASCII and binary) rely on a strict ordering of variable values in the output file and e.g. even renaming one variable can affect this. The NetCDF version stores the variable names and will not be able to find a variable if its name has changed.

The LFRic kernel driver will inline all required external modules into the driver. It uses a ModuleManager to find the required modules, based on the assumption that a file my_special_mod.f90 will define exactly one module called my_special_mod (the _mod is required to be part of the filename). The driver creator will sort the modules in the appropriate order and add the source code directly into the driver. As a result, the driver program is truly stand-alone and does not need any external dependency (the only exception being NetCDF if the NetCDF-based extraction library is used). The ModuleManager uses all kernel search paths specified on the command line (see -d option in The psyclone command), and it will recursively search for all files under each path specified on the command line.

Therefore, compilation for a created driver, e.g. the one created in examples/lfric/eg17/full_example_extract, is simple:

 $ gfortran -g -O0 driver-main-update.F90 -o driver-main-update
 $ ./driver-main-update
   Variable        count    identical    #rel<1E-9    #rel<1E-6    #rel<1E-3   #rel>=1E-3      max_abs      max_rel      l2_diff       l2_cos
       cell            1            1            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01
field1_data          539          539            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01
 dummy_var1            1            1            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01

(see Driver Summary Statistics for details about the statistics`). Note that the Makefile in the example will actually provide additional include paths (infrastructure files and extraction library) for the compiler, but these flags are actually only required for compiling the example program, not for the driver.

Extraction for generic Fortran#

The libraries in lib/extract/binary/generic, lib/extract/ascii/generic and lib/extract/netcdf/generic implement the full PSyData API for use with generic code transformation. When running the code, it will create an output file for each instrumented code region. The same logic for naming variables used in Extraction for GOcean is used here.

Note

Driver creation for generic Fortran is not yet supported, and is tracked in issue #2058.

Driver Summary Statistics#

When a driver is executed, it will print summary statistics at the end for each variable that was modified, indicating the difference between the original values from when the data file was created, and the new ones computed when executing the kernel. These differences can be caused by changing the compilation options, or compiler version. Example output:

   Variable        count    identical    #rel<1E-9    #rel<1E-6    #rel<1E-3   #rel>=1E-3      max_abs      max_rel      l2_diff       l2_cos
       cell            1            1            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01
field1_data          539          539            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01
 dummy_var1            1            1            0            0            0            0 .0000000E+00 .0000000E+00 .0000000E+00 .1000000E+01

The columns from left to right are:

The variable name.
The number of elements for this variable (i.e. 1 for scalar).
How many values are identical.
How many values have a relative error of less than 10^-9 but are not identical. Note that single precision variables typically do not have enough significant digits to have an error of 10^-9.
How many values have a relative error of less than 10^-6 but more than 10^-9.
How many values have a relative error of less than 10^-3 but more than 10^-6.
The maximum absolute error of all elements.
The maximum relative error of all elements. If an element has the value 0, the relative error for this element is considered to be 1.0.
The L2 difference: sqrt(sum((original-new)² )).
The cosine of the angle between the two vectors: sum(original*new)/(sqrt(sum(original*original))*sqrt(sum(new*new))).

Note

The usefulness of the columns printed is still being evaluated. Early indications are that the cosine of the angle between the two vectors, which is commonly used in AI, might not be sensitive enough to give a good indication of the differences.