Loop parallelisation
Overview
Teaching: 15 min
Exercises: 15 minQuestions
Basic OpenMP directives to parallelise loop nests
Objectives
Apply basic directives to parallelise loops
Loop parallelisation
Where to start?
This episode starts in 3_loops/ directory. .
In this section we will apply basic OpenMP directives to parallelise loop nests identified as the most computationally expensive in the previous profiling step.
Loop parallelisation directives can be placed right before each of the loop nests in the code.
OpenMP
We will start by inserting directives right before the first for loop of the loop nests. We will start by inserting the target directive, which will (for each of the structure-blocks):
- create/destroy data environment on device(s),
- map data between host and device(s) data environment,
- offloads OpenMP target regions (structured-block) to target device(s),
- automatically update the data between the host and device(s). Please note that the target directive will not attempt to parallelise the underlying loop nests. For this to happen, we will need to be more prescriptive to specify what we want to achieve.
To achieve proper parallelisation across available GPU threads we will use two following OpenMP constructs:
- teams, which creates a league of thread teams with the master thread of each team executing the region,
- distribute parallel for, which specifies a loop that can be executed in parallel by multiple threads that are members of multiple teams.
In the case of our Laplace example those directives can be applied as follows:
// main computational kernel, average over neighbours in the grid
#pragma omp target
#pragma omp teams distribute parallel for collapse(2)
for(i = 1; i <= GRIDX; i++)
for(j = 1; j <= GRIDY; j++)
T_new[i][j] = 0.25 * (T[i+1][j] + T[i-1][j] +
T[i][j+1] + T[i][j-1]);
// reset dt
dt = 0.0;
// compute the largest change and copy T_new to T
#pragma omp target map(dt)
#pragma omp teams distribute parallel for collapse(2) reduction(max:dt)
for(i = 1; i <= GRIDX; i++){
for(j = 1; j <= GRIDY; j++){
dt = MAX( fabs(T_new[i][j]-T[i][j]), dt);
T[i][j] = T_new[i][j];
}
}
Important notes
- In the case of the second loop nest we are also specifying that there is a reduction on dt variable by adding reduction(max:dt) clause,
- We are also manually specifying that variable dt needs to be mapped between host and device data environments. This will be discussed in the next step of the tutorial, for now we should just keep in mind that in OpenMP scalar variables that are not explicitly mapped are implicitly mapped as firstprivate.
Comments and further analysis
Note, that we were not required to change the structure of the code to achieve GPU parallelisation. Although the Laplace example used in this tutorial gives us a space to explore various OpenMP directives and options, this is still a very simple program. In general cases, GPU parallelisation might require code restructure.
KEY COMMENTS
- This is usually not the last step of GPU programming with directives. Deep analysis of data transfers will be done in next step. It is also important not to rely on automatic parallelisation techniques but to understand how different parameters (like block and vector sizes) might impact the final performance.
Key Points
We have explored OpenMP loop constructs for GPU parallelisation