Loop parallelisation

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • Basic OpenMP directives to parallelise loop nests

Objectives
  • Apply basic directives to parallelise loops

Loop parallelisation

Where to start?

This episode starts in 3_loops/ directory. .

In this section we will apply basic OpenMP directives to parallelise loop nests identified as the most computationally expensive in the previous profiling step.

Loop parallelisation directives can be placed right before each of the loop nests in the code.

OpenMP

We will start by inserting directives right before the first for loop of the loop nests. We will start by inserting the target directive, which will (for each of the structure-blocks):

To achieve proper parallelisation across available GPU threads we will use two following OpenMP constructs:

In the case of our Laplace example those directives can be applied as follows:

// main computational kernel, average over neighbours in the grid
#pragma omp target
#pragma omp teams distribute parallel for collapse(2)
for(i = 1; i <= GRIDX; i++)
    for(j = 1; j <= GRIDY; j++)
        T_new[i][j] = 0.25 * (T[i+1][j] + T[i-1][j] +
                                    T[i][j+1] + T[i][j-1]);

// reset dt
dt = 0.0;

// compute the largest change and copy T_new to T
#pragma omp target map(dt)
#pragma omp teams distribute parallel for collapse(2) reduction(max:dt)
for(i = 1; i <= GRIDX; i++){
    for(j = 1; j <= GRIDY; j++){
      dt = MAX( fabs(T_new[i][j]-T[i][j]), dt);
      T[i][j] = T_new[i][j];
    }
}

Important notes

  1. In the case of the second loop nest we are also specifying that there is a reduction on dt variable by adding reduction(max:dt) clause,
  2. We are also manually specifying that variable dt needs to be mapped between host and device data environments. This will be discussed in the next step of the tutorial, for now we should just keep in mind that in OpenMP scalar variables that are not explicitly mapped are implicitly mapped as firstprivate.

Comments and further analysis

Note, that we were not required to change the structure of the code to achieve GPU parallelisation. Although the Laplace example used in this tutorial gives us a space to explore various OpenMP directives and options, this is still a very simple program. In general cases, GPU parallelisation might require code restructure.

KEY COMMENTS

  1. This is usually not the last step of GPU programming with directives. Deep analysis of data transfers will be done in next step. It is also important not to rely on automatic parallelisation techniques but to understand how different parameters (like block and vector sizes) might impact the final performance.

Key Points

  • We have explored OpenMP loop constructs for GPU parallelisation