GPGPU¶

Acceleration of applications on graphics processing unit (GPU)

Documentation¶

CUDA

The CUDA API, which is based on C/C++ language, allows the acceleration of computational codes on NVIDIA GPUs: https://developer.nvidia.com/cuda-zone

On the Myria file system, the directories /soft/cuda_<version>/cuda/doc and /soft/cuda_<version>/cuda/doc/html contain the CUDA 8.0 and 9.1 documentations
CUDA FORTRAN

The FORTRAN version of the CUDA API was originally published by PGI.

The user documentation is available in PDF format from the publisher's website: https://www.pgroup.com/resources/docs/19.7/pdf/pgi19cudaforug.pdf

The latest version is published by NVIDIA: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html
OpenACC Directives

Les directives OpenACC (http://www.openacc-standard.org/) constituent un standard de la programmation par directives pour GPU. La norme OpenACC 2.6 est supportée par le compilateur PGI 18.4.
Les développements sont simplifiés par rapport à l'utilisation directe de l'API CUDA. Cette dernière permet parfois d'exploiter plus finement l'architecture d'un GPU. Dans d'autres cas, OpenACC peut être efficace et réduire le rapport temps de développement / performance. OpenACC peut aussi être utile pour aborder un travail de portage : certaines conclusions tirées peuvent être mise en œuvre ultérieurement avec CUDA.
OpenMP target Directives

The OpenMP target directives are supported in particular by the "NVIDIA HPC SDK" compiler, from its version 20.11.

On Myria, the course material /soft/formations/OpenMP-4to5-ATOS-CRIANN-2020/OpenMP4to5-final.pdf includes a final part "Data Management for Devices" about these directives.

Software environment¶

The following commands activate the desired version of CUDA, the PGI compiler (for OpenACC or CUDA FORTRAN) or the NVIDIA compiler (HPC SDK, for OpenACC, CUDA FORTRAN or OpenMP target) on Myria. For these tools, more recent versions than those mentioned below may be available (see module avail).

These commands must be executed on one of the front ends for compilations, and in the commands of a submission script on GPU resources.

CUDA
```
login@Myria-1:~ module load cuda/9.1

cuda/9.1 environment
```
- Makefile templates: /soft/makefiles/SERIAL_GPU_CODES (CUDA and OpenCL examples), /soft/makefiles/MPI_CUDA_CODES/MakeIntelMPI_CUDA_C++

CUDA with MPI library (Open MPI 3.0.1)

login@Myria-1:~ module purge
login@Myria-1:~ module load cuda/9.1 mpi/openmpi-3.0.1_cuda-9.1

load cuda 9.1 environment
load compilers/gnu/5.5.0 environment
load openmpi-3.0.1_cuda-9.1 environment

Makefile template for CUDA + Open MPI 3.0.1 code: /soft/makefiles/MPI_CUDA_CODES/MakeOpenMPI_CUDA_C++

OpenACC or CUDA FORTRAN
```
login@Myria-1:~ module load pgi/18.4

pgi/18.4 environment set (OpenACC and CUDA FORTRAN supports)
```
- Makefile template for CUDA FORTRAN : /soft/makefiles/MPI_CUDA_CODES/MakeIntelMPI_CUDA_Fortran
- Makefile template for OpenACC : /soft/makefiles/MPI_OpenACC_CODES/MakeIntelMPI_PGI_OpenACC

OpenACC with MPI library (Open MPI 3.0.1)

login@Myria-1:~ module purge
login@Myria-1:~ module load pgi/18.4 mpi/pgi-18.4_openmpi-3.0.1

load pgi/18.4 environment (OpenACC and CUDA FORTRAN support)
load pgi-18.4_openmpi-3.0.1 environment

Makefile template for OpenACC + Open MPI 3.0.1 code: /soft/makefiles/MPI_OpenACC_CODES/MakeOpenMPI_PGI_OpenACC

OpenACC, CUDA FORTRAN ou OpenMP target avec librairie MPI (Open MPI 4.0.5)
```
login@Myria-1:~ module purge
login@Myria-1:~ module load -s mpi/nvhpc-21.7_openmpi-4.0.5-cuda-11.1
login@Myria-1:~ module list
Currently Loaded Modulefiles:
1) nvidia-compilers/21.7   2) cuda/11.1   3) mpi/nvhpc-21.7_openmpi-4.0.5-cuda-11.1
```
- Makefile template for CUDA FORTRAN (options du compilateur NVIDIA) + Open MPI 4.0.5 code: /soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Directive_Kernel/Makefile_CudaFortran
- Makefile template for OpenMP target + Open MPI 4.0.5 code: /soft/sample_codes/openmp_target/P2DJ_OpenMP-target/Makefile

Jobs submission¶

Myria GPU compute nodes accept the partitions (submission classes) gpu_all, gpu_court, gpu_k80, gpu_p100 and gpu_v100.

One or more GPU compute nodes can be dedicated to a user, on demand, for development purposes (requiring a high availability of the GPU resource): Contact support@criann.fr.

To run accelerated code on GPU, simply add the directives :

# Partition (gpu_k80, gpu_p100 or gpu_v100)
#SBATCH --partition gpu_p100
# GPUs per compute node
#SBATCH --gres gpu:2

and environments activation:

# Source cuda or pgi environment
module load cuda/10.0
#module load pgi/19.10

in one of the script templates job_serial.sl, job_OpenMP.sl, job_MPI(_OpenMP).sl available in /soft/slurm/criann_modeles_scripts. The file /soft/slurm/criann_modeles_scripts/job_MPI_OpenMP_GPU.sl provides the example for MPI / OpenMP / CUDA code.

Note: in the case of GPU-accelerated MPI code, applying the #SBATCH --nodes and #SBATCH --ntasks-per-node directives (instead of #SBATCH --ntasks) is useful.

Indeed, if it is wished that each MPI process of the application addresses a different GPU, it is enough to apply #SBATCH --ntasks-per-node 4 in gpu_k80 partition (because the targeted servers have 4 GPUs (2 Kepler K80 cards)) or #SBATCH --ntasks-per-node 2 in gpu_p100 partition (because the targeted servers have 2 GPUs (2 Pascal P100 cards))

Example:

##
# Partition (gpu_k80, gpu_p100 or gpu_v100)
#SBATCH --partition gpu_p100
#
# GPUs per compute node
#SBATCH --gres gpu:2
#
# Compute nodes number
#SBATCH --nodes 3
#
# MPI tasks per compute node
#SBATCH --ntasks-per-node 2
#
# Threads per MPI tasks (if needed)
#SBATCH --cpus-per-task 8

Architecture V100-SXM2¶

Five servers are each equipped with four V100-SXM2 GPUs (32 GB of memory), two SkyLake CPUs (x 16 cores at 2.1 GHz) and 187 GB of DDR4 RAM.

The four GPUs in a compute node are interconnected in pairs by an NVLink2 link at 100 GB/s bandwidth (2 x 50 GB/s in each direction).

These machines each have two Omni-Path network interfaces and are accessed through the gpu_v100 partition with Slurm.

CUDA aware MPI¶

The configuration allows execution of code programmed using a CUDA aware MPI approach (calling MPI functions for data allocated in GPU memory).

This type of code can use the NVLink for intra-node MPI exchanges.

Inter-node (cuda aware) MPI exchanges also benefit from direct GPU memory to GPU memory access over the fast network (GPUDirect RDMA technology supported by Omni-Path).

Environments¶

Open MPI 4.0.5, CUDA 11.1 et gcc 7.3.0

login@Myria-1:~ module load mpi/openmpi-4.0.5-cuda-11.1

or: Open MPI 4.0.2, CUDA 10.1 et icc 17.1

login@Myria-1:~ module load mpi/openmpi-4.0.2-icc-17.1-cuda-10.1

or: Open MPI 4.0.5, CUDA 11.1 et PGI 20.7 (pour OpenACC (C, C++, FORTRAN) ou CUDA FORTRAN)
```
login@Myria-1:~ module load mpi/pgi-20.7_openmpi-4.0.5-cuda-11.1
```
or: Open MPI 4.0.5, CUDA 11.1 et nvhpc 21.7 (pour OpenACC (C, C++, FORTRAN), CUDA FORTRAN ou les directives Open MP target pour GPU)
```
login@Myria-1:~ module load mpi/nvhpc-21.7_openmpi-4.0.5-cuda-11.1
```

Programming¶

In a CUDA+MPI approach, an MPI task must be assigned to a GPU by appropriate functions: cudaSetDevice() for CUDA or acc_set_device_num() with OpenACC.

For CUDA aware MPI with Omni-Path (Myria's interconnection network), this parallel process/GPU association must be done before the call to MPI_Init().

The programming elements to perform this association are provided by IDRIS (for the case of Slurm, Myria's batch too), for CUDA (C) or OpenACC:

http://www.idris.fr/eng/jean-zay/gpu/jean-zay-gpu-mpi-cuda-aware-gpudirect-eng.html

When using CUDA FORTRAN, this association must be done by the following code:

    subroutine init_device()
        USE cudafor
        USE MPI

        implicit  none

        character(len=6) :: local_rank_env
        integer          :: local_rank_env_status, local_rank, istat

        call get_environment_variable (name="SLURM_LOCALID", &
             value=local_rank_env, status=local_rank_env_status)

        if (local_rank_env_status == 0) then
            read(local_rank_env, *) local_rank
            istat = cudaSetDevice(local_rank)
        else
            print *, "Slurm batch system must be used"
            STOP 1
        end if
      end subroutine init_device

Dans le cas d'utilisation des directives OpenMP target, avec le compilateur nvidia (module mpi/nvhpc indiqué ci-dessus), cette association doit se faire par le code suivant :

subroutine init_device()
  USE cudafor
  USE omp_lib
  USE MPI

  implicit  none

  character(len=6) :: local_rank_env
  integer          :: local_rank_env_status, local_rank, istat

  call get_environment_variable (name="SLURM_LOCALID", &
       value=local_rank_env, status=local_rank_env_status)

  if (local_rank_env_status == 0) then
      read(local_rank_env, *) local_rank
      istat = cudaSetDevice(local_rank)
      call omp_set_default_device(local_rank)
  else
      print *, "Slurm batch system must be used"
      STOP 1
  end if
end subroutine init_device

Examples of programs¶

CUDA FORTRAN¶

Examples files: /soft/sample_codes/cuda_fortran/P2DJ_CUF/ on Myria

This code example uses the CUDA FORTRAN API. It allocates some data in non-paged host memory (keyword PINNED in the FORTRAN statement declaring the array). This data is then transferred in a performance-optimized way between the GPU and the host by the cudaMemcpyAsync() function. Data allocations in GPU memory are done by the DEVICE keyword in the FORTRAN statement declaring the array. This program calls MPI from host to host. The code is declined in two variants:

Writing CUDA kernels in classical form (analogous to CUDA C): /soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Classical_Kernel/
Writing CUDA kernels using NVIDIA's !$cuf kernel do directive: /soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Directive_Kernel/

This approach combines the ease of use of kernel directives with CUDA features for optimized data transfers between host and GPU. The example uses the asynchronous form of the !$cuf kernel do directive (which brings the CUDA stream concept into play).

OpenMP target directives¶

Examples files: /soft/sample_codes/openmp_target/P2DJ_OpenMP-target on Myria

This code example uses the OpenMP target directives in FORTRAN (!$omp target).

It performs MPI communications from GPU to GPU (cuda aware MPI).

The association between MPI task and GPU is done before the call to MPI_Init(), by the init_device() subroutine (also written above).

Last update: February 1, 2024 10:18:06