GPGPU¶
Acceleration of applications on graphics processing unit (GPU)
Documentation¶
-
CUDA
The CUDA API, which is based on C/C++ language, allows the acceleration of computational codes on NVIDIA GPUs: https://developer.nvidia.com/cuda-zone
On the Myria file system, the directories
/soft/cuda_<version>/cuda/doc
and/soft/cuda_<version>/cuda/doc/html
contain the CUDA 8.0 and 9.1 documentations -
CUDA FORTRAN
The FORTRAN version of the CUDA API was originally published by PGI.
The user documentation is available in PDF format from the publisher's website: https://www.pgroup.com/resources/docs/19.7/pdf/pgi19cudaforug.pdf
The latest version is published by NVIDIA: https://docs.nvidia.com/hpc-sdk/compilers/cuda-fortran-prog-guide/index.html
-
OpenACC Directives
Les directives OpenACC (http://www.openacc-standard.org/) constituent un standard de la programmation par directives pour GPU. La norme OpenACC 2.6 est supportée par le compilateur PGI 18.4.
Les développements sont simplifiés par rapport à l'utilisation directe de l'API CUDA. Cette dernière permet parfois d'exploiter plus finement l'architecture d'un GPU. Dans d'autres cas, OpenACC peut être efficace et réduire le rapport temps de développement / performance. OpenACC peut aussi être utile pour aborder un travail de portage : certaines conclusions tirées peuvent être mise en œuvre ultérieurement avec CUDA. -
OpenMP target Directives
The OpenMP target directives are supported in particular by the "NVIDIA HPC SDK" compiler, from its version 20.11.
On Myria, the course material
/soft/formations/OpenMP-4to5-ATOS-CRIANN-2020/OpenMP4to5-final.pdf
includes a final part "Data Management for Devices" about these directives.
Software environment¶
The following commands activate the desired version of CUDA, the PGI compiler (for OpenACC or CUDA FORTRAN) or the NVIDIA compiler (HPC SDK, for OpenACC, CUDA FORTRAN or OpenMP target) on Myria. For these tools, more recent versions than those mentioned below may be available (see module avail
).
These commands must be executed on one of the front ends for compilations, and in the commands of a submission script on GPU resources.
-
CUDA
- Makefile templates:
/soft/makefiles/SERIAL_GPU_CODES
(CUDA and OpenCL examples),/soft/makefiles/MPI_CUDA_CODES/MakeIntelMPI_CUDA_C++
- Makefile templates:
-
CUDA with MPI library (Open MPI 3.0.1)
login@Myria-1:~ module purge login@Myria-1:~ module load cuda/9.1 mpi/openmpi-3.0.1_cuda-9.1 load cuda 9.1 environment load compilers/gnu/5.5.0 environment load openmpi-3.0.1_cuda-9.1 environment
- Makefile template for CUDA + Open MPI 3.0.1 code:
/soft/makefiles/MPI_CUDA_CODES/MakeOpenMPI_CUDA_C++
- Makefile template for CUDA + Open MPI 3.0.1 code:
-
OpenACC or CUDA FORTRAN
- Makefile template for CUDA FORTRAN :
/soft/makefiles/MPI_CUDA_CODES/MakeIntelMPI_CUDA_Fortran
- Makefile template for OpenACC :
/soft/makefiles/MPI_OpenACC_CODES/MakeIntelMPI_PGI_OpenACC
- Makefile template for CUDA FORTRAN :
-
OpenACC with MPI library (Open MPI 3.0.1)
login@Myria-1:~ module purge login@Myria-1:~ module load pgi/18.4 mpi/pgi-18.4_openmpi-3.0.1 load pgi/18.4 environment (OpenACC and CUDA FORTRAN support) load pgi-18.4_openmpi-3.0.1 environment
- Makefile template for OpenACC + Open MPI 3.0.1 code:
/soft/makefiles/MPI_OpenACC_CODES/MakeOpenMPI_PGI_OpenACC
- Makefile template for OpenACC + Open MPI 3.0.1 code:
-
OpenACC, CUDA FORTRAN ou OpenMP target avec librairie MPI (Open MPI 4.0.5)
login@Myria-1:~ module purge login@Myria-1:~ module load -s mpi/nvhpc-21.7_openmpi-4.0.5-cuda-11.1 login@Myria-1:~ module list Currently Loaded Modulefiles: 1) nvidia-compilers/21.7 2) cuda/11.1 3) mpi/nvhpc-21.7_openmpi-4.0.5-cuda-11.1
- Makefile template for CUDA FORTRAN (options du compilateur NVIDIA) + Open MPI 4.0.5 code:
/soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Directive_Kernel/Makefile_CudaFortran
- Makefile template for OpenMP target + Open MPI 4.0.5 code:
/soft/sample_codes/openmp_target/P2DJ_OpenMP-target/Makefile
- Makefile template for CUDA FORTRAN (options du compilateur NVIDIA) + Open MPI 4.0.5 code:
Jobs submission¶
Myria GPU compute nodes accept the partitions (submission classes) gpu_all, gpu_court, gpu_k80, gpu_p100 and gpu_v100.
One or more GPU compute nodes can be dedicated to a user, on demand, for development purposes (requiring a high availability of the GPU resource): Contact support@criann.fr.
To run accelerated code on GPU, simply add the directives :
# Partition (gpu_k80, gpu_p100 or gpu_v100)
#SBATCH --partition gpu_p100
# GPUs per compute node
#SBATCH --gres gpu:2
and environments activation:
in one of the script templates job_serial.sl
, job_OpenMP.sl
, job_MPI(_OpenMP).sl
available in /soft/slurm/criann_modeles_scripts
. The file /soft/slurm/criann_modeles_scripts/job_MPI_OpenMP_GPU.sl
provides the example for MPI / OpenMP / CUDA code.
Note: in the case of GPU-accelerated MPI code, applying the #SBATCH --nodes
and #SBATCH --ntasks-per-node
directives (instead of #SBATCH --ntasks
) is useful.
Indeed, if it is wished that each MPI process of the application addresses a different GPU, it is enough to apply #SBATCH --ntasks-per-node 4
in gpu_k80
partition (because the targeted servers have 4 GPUs (2 Kepler K80 cards)) or #SBATCH --ntasks-per-node 2
in gpu_p100
partition (because the targeted servers have 2 GPUs (2 Pascal P100 cards))
Example:
##
# Partition (gpu_k80, gpu_p100 or gpu_v100)
#SBATCH --partition gpu_p100
#
# GPUs per compute node
#SBATCH --gres gpu:2
#
# Compute nodes number
#SBATCH --nodes 3
#
# MPI tasks per compute node
#SBATCH --ntasks-per-node 2
#
# Threads per MPI tasks (if needed)
#SBATCH --cpus-per-task 8
Architecture V100-SXM2¶
Five servers are each equipped with four V100-SXM2 GPUs (32 GB of memory), two SkyLake CPUs (x 16 cores at 2.1 GHz) and 187 GB of DDR4 RAM.
The four GPUs in a compute node are interconnected in pairs by an NVLink2 link at 100 GB/s bandwidth (2 x 50 GB/s in each direction).
These machines each have two Omni-Path network interfaces and are accessed through the gpu_v100
partition with Slurm.
CUDA aware MPI¶
The configuration allows execution of code programmed using a CUDA aware MPI approach (calling MPI functions for data allocated in GPU memory).
This type of code can use the NVLink for intra-node MPI exchanges.
Inter-node (cuda aware) MPI exchanges also benefit from direct GPU memory to GPU memory access over the fast network (GPUDirect RDMA technology supported by Omni-Path).
Environments¶
-
Open MPI 4.0.5, CUDA 11.1 et gcc 7.3.0
-
or: Open MPI 4.0.2, CUDA 10.1 et icc 17.1
-
or: Open MPI 4.0.5, CUDA 11.1 et PGI 20.7 (pour OpenACC (C, C++, FORTRAN) ou CUDA FORTRAN)
-
or: Open MPI 4.0.5, CUDA 11.1 et nvhpc 21.7 (pour OpenACC (C, C++, FORTRAN), CUDA FORTRAN ou les directives Open MP target pour GPU)
Programming¶
In a CUDA+MPI approach, an MPI task must be assigned to a GPU by appropriate functions: cudaSetDevice()
for CUDA or acc_set_device_num()
with OpenACC.
For CUDA aware MPI with Omni-Path (Myria's interconnection network), this parallel process/GPU association must be done before the call to MPI_Init()
.
The programming elements to perform this association are provided by IDRIS (for the case of Slurm, Myria's batch too), for CUDA (C) or OpenACC:
http://www.idris.fr/eng/jean-zay/gpu/jean-zay-gpu-mpi-cuda-aware-gpudirect-eng.html
When using CUDA FORTRAN, this association must be done by the following code:
subroutine init_device()
USE cudafor
USE MPI
implicit none
character(len=6) :: local_rank_env
integer :: local_rank_env_status, local_rank, istat
call get_environment_variable (name="SLURM_LOCALID", &
value=local_rank_env, status=local_rank_env_status)
if (local_rank_env_status == 0) then
read(local_rank_env, *) local_rank
istat = cudaSetDevice(local_rank)
else
print *, "Slurm batch system must be used"
STOP 1
end if
end subroutine init_device
Dans le cas d'utilisation des directives OpenMP target, avec le compilateur nvidia (module mpi/nvhpc
indiqué ci-dessus), cette association doit se faire par le code suivant :
subroutine init_device()
USE cudafor
USE omp_lib
USE MPI
implicit none
character(len=6) :: local_rank_env
integer :: local_rank_env_status, local_rank, istat
call get_environment_variable (name="SLURM_LOCALID", &
value=local_rank_env, status=local_rank_env_status)
if (local_rank_env_status == 0) then
read(local_rank_env, *) local_rank
istat = cudaSetDevice(local_rank)
call omp_set_default_device(local_rank)
else
print *, "Slurm batch system must be used"
STOP 1
end if
end subroutine init_device
Examples of programs¶
CUDA FORTRAN¶
Examples files: /soft/sample_codes/cuda_fortran/P2DJ_CUF/
on Myria
This code example uses the CUDA FORTRAN API. It allocates some data in non-paged host memory (keyword PINNED
in the FORTRAN statement declaring the array). This data is then transferred in a performance-optimized way between the GPU and the host by the cudaMemcpyAsync()
function. Data allocations in GPU memory are done by the DEVICE
keyword in the FORTRAN statement declaring the array. This program calls MPI from host to host. The code is declined in two variants:
- Writing CUDA kernels in classical form (analogous to CUDA C):
/soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Classical_Kernel/
-
Writing CUDA kernels using NVIDIA's
!$cuf kernel do
directive:/soft/sample_codes/cuda_fortran/P2DJ_CUF/P2DJ_CUF_Directive_Kernel/
This approach combines the ease of use of kernel directives with CUDA features for optimized data transfers between host and GPU. The example uses the asynchronous form of the
!$cuf kernel do
directive (which brings the CUDA stream concept into play).
OpenMP target directives¶
Examples files: /soft/sample_codes/openmp_target/P2DJ_OpenMP-target
on Myria
This code example uses the OpenMP target directives in FORTRAN (!$omp target
).
It performs MPI communications from GPU to GPU (cuda aware MPI).
The association between MPI task and GPU is done before the call to MPI_Init()
, by the init_device()
subroutine (also written above).