Data management

Overview

Teaching: 15 min
Exercises: 15 min
Questions
  • Usage of OpenACC and OpenMP data mapping directives

Objectives
  • Perform basic profiling of GPU events

  • Apply data transfer OpenACC and OpenMP directives to improve the performance of the code

  • Understand differences between memory models

Data management

Where to start?

This episode starts in 4_data/ directory. Decide if you want to work on OpenACC, OpenMP or both and follow the instructions below.

Non-optimal memory management (e.g. excessive memory transfers) can heavily impact the performance of any GPU accelerated code. Therefore it is very important to understand how memory is being mapped and copied between host and device.

When using PGI compiler for OpenACC this can be achieved by using -Minfo=accel compiler option. The information about memory transfers will be printed to stdout.

pgcc -O3 -acc -Minfo=accel -c -o laplace_acc.o laplace_acc.c
main:
     43, Generating implicit copyin(T[:][:]) [if not already present]
         Generating implicit copyout(T_new[1:2048][1:2048]) [if not already present]
     44, Loop is parallelizable
     45, Loop is parallelizable
         Generating Tesla code
         44, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         45, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
     53, Generating implicit copyin(T_new[1:2048][1:2048]) [if not already present]
         Generating implicit copy(T[1:2048][1:2048]) [if not already present]
     54, Loop is parallelizable
     55, Loop is parallelizable
         Generating Tesla code
         54, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
         55, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
         56, Generating implicit reduction(max:dt)

As can be seen from the above report arrays T and T_new are being copied multiple times in and out between host and device. This copying occurs in every iteration of the algorithm.

Note

We should acknowledge the importance of -Minfo=accel compiler feedback option of the PGI compiler for OpenACC. GCC and Clang does not provide similar functionality for OpenMP

The impact of memory transfers on the current performance of the GPU kernel can be also measured by e.g. nvprof profiler by running:

bash-4.2$ srun -u -n 1 nvprof ./laplace_mp 4000

As can be seen from the report generated below for the OpenMP version of the code, memory transfers represent more than 98% of the runtime (HtoD stands for Host to Device, DtoH stands from Device to Host).

==228979== Profiling application: ./laplace_mp 4000
==228979== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   51.25%  60.3494s     11236  5.3711ms  1.1520us  9.4626ms  [CUDA memcpy HtoD]
                   47.73%  56.2051s     11237  5.0018ms  1.4080us  10.967ms  [CUDA memcpy DtoH]
                    0.84%  987.39ms      2247  439.43us  430.30us  463.33us  __omp_offloading_47c4f666_4f0059e6_main_l56
                    0.18%  217.23ms      2247  96.677us  95.583us  98.144us  __omp_offloading_47c4f666_4f0059e6_main_l45

Analysing data transfers

As we’ve seen memory transfers can take significant amount of time if scheduled improperly. In the case of the Laplace example T and T_new arrays are being copied multiple times in every iteration of the algorithm. More precisely, in each iteration of the algorithm we have:

This gives us 5 data transfers of a 33.5 MB buffer per iteration and 11,000 data transfers for the entire run. However if we analyse data accesses in the implementation, we can clearly see that there is no need for this, we don’t need any results on the host until after the while loop exits. We will try to fix it by using OpenACC and OpenMP compiler directives to indicate when and which data transfers should occur.

In both cases this is fairly simple. For OpenACC we place acc data directive right before the while loop:

#pragma acc data copy(T), create(T_new)
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

We can achieve the same for OpenMP with the use of omp target data directive placed right before the while loop:

#pragma omp target data map(tofrom:T) map(alloc:T_new)
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

There is actually a one to one mapping between OpenACC and OpenMP data transfer constructs.

OpenACC construct OpenMP construct
copyin(A) map(to:A)
copyout(A) map(from:A)
copy(A) map(tofrom:A)
create(A) map(alloc:A)

Let’s run the nvprof profiling again on the OpenMP version.

bash-4.2$ srun -u -n 1 nvprof ./laplace_mp 4000
==301161== Profiling application: ./laplace_mp 4000
==301161== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   80.95%  1.00671s      2247  448.03us  434.78us  458.08us  __omp_offloading_47c4f666_6901a48d_main_l56
                   16.88%  209.86ms      2247  93.396us  92.415us  95.551us  __omp_offloading_47c4f666_6901a48d_main_l45
                    1.36%  16.877ms      2250  7.5000us  1.2160us  13.957ms  [CUDA memcpy DtoH]
                    0.81%  10.120ms      2249  4.4990us  1.2790us  7.1050ms  [CUDA memcpy HtoD]

What we notice is that the code runs much faster now and as can be seen from the profiler information the memory transfers are taking only small fraction of runtime. GPU kernels represent around 98% of the runtime.

We have successfully and significantly reduced the total number of memory transfers of the large T and T_new arrays: from 11,000 transfers to only 2 transfers per run.

Key differences

Although we claim that we have significantly reduced the number of data transfers, the nvprof report is still indicating that there was around 2250 x 2 data transfers. Those transfers are related to the use of dt in the second loop nest. This scalar variable needs to be copied in and out in every iteration of the algorithm. As mentioned before, there is a small difference on how the dt variable is declared in OpenACC and OpenMP versions of the code. In the case of OpenMP we need to be more prescriptive and specify the type of data transfers for the dt variable. This is related to differences in how scalar variables are treated in kernels and target constructs.

Default scalar mapping

Note

Scalar variables are treated slightly differently in OpenACC and OpenMP GPU regions.

In OpenMP a scalar variable that is not explicitly mapped is implicitly mapped as firstprivate, although this behaviour can be changed with the use of defaultmap(tofrom:scalar) clause.

In OpenACC a scalar variables that is not explicitly mapped (copied) will be treated:

This is why in the OpenMP implementation we need to explicitly map the dt variable which occurs in the reduction clause.

// compute the largest change and copy T_new to T
#pragma omp target map(dt)
#pragma omp teams distribute parallel for collapse(2) reduction(max:dt)
for(i = 1; i <= GRIDX; i++){
    for(j = 1; j <= GRIDY; j++){
      dt = MAX( fabs(T_new[i][j]-T[i][j]), dt);
      T[i][j] = T_new[i][j];
    }
}

Please be aware that similar data mapping would need to be explicitly provided if we would decide to implement OpenACC version of the code with more prescriptive parallel construct instead of kernels construct.

Key Points

  • We have successfully and significantly reduced the total number of memory transfers

  • We have significantly increased the performance of both GPU implementations