KNL¶
Architecture¶
The KNL is a processor with a large number of cores (64 to 72) at a moderate frequency (1.3 to 1.5 GHz). A SMT (Simultaneous Multi Threading) mode allows each core to run four processes if the user wishes.
This architecture is suitable for parallel applications (OpenMP or multithreaded, MPI or MPI+OpenMP in particular), which accelerate to at least a few dozen processes. MPI+OpenMP hybridization can help performance compared to the pure MPI approach.
Memory¶
The KNL processor has an on-board fast memory area (MCDRAM) with a size of 16 GB.
A KNL server can be configured in three different modes for the use of this memory by an application.
-
cache mode
In this mode, the MCDRAM memory is used by the system as a large (16 GB) third-level cache ("L3 cache"): data from the RAM (DDR4) of a KNL server are periodically stored in this fast memory area close to the processor.
-
flat mode
In this mode, the MCDRAM memory is an addressable memory space, in which an application can allocate all or part of its data (the others remaining allocated in the server RAM). These allocations can be specified in the program or by execution commands (this is specified in the paragraph "Exploiting the fast MCDRAM").
-
hybride mode
In the hybrid mode, a quarter or half (4 or 8 GB) of the MCDRAM is L3 cache, the rest (8 or 12 GB) is addressable memory space.
Clustering modes¶
The main modes of clustering, i.e. of interconnecting the cores within a KNL, are the following. The notion of NUMA (Non Uniform Memory Access) entity, mentioned below, is an architecture in which the cores access certain areas of the memory more quickly than others.
-
Quadrant mode
In quadrant mode, the KNL cores are organized into four groups, each located near a memory controller. A KNL in quadrant mode is seen by the system as a single NUMA entity.
-
SNC modes
In SNC (Sub Numa Clustering) modes, the KNL is subdivided into two (SNC-2 mode) or four (SNC-4 mode) NUMA entities.
Configuration on Myria¶
Myria consists of ten compute nodes, each with a KNL model 7210, i.e., 64 cores (4 threads per core), 1.3 GHz CPU frequency with 96 GB of DDR4 RAM at 2133 MHz per node.
These machines (as well as Myria's general purpose Broadwell processor machines) are interconnected by a high-speed (100 Gbps), low latency Omni-Path network. The Omni-Path controller is integrated into the KNL chip.
These ten KNL nodes are configured (statically) in quadrant clustering mode and in cache memory mode for the embedded 16 GB MCDRAM.
This configuration can be changed on multiple nodes (e.g., to test flat mode for MCDRAM) upon request to support@criann.fr.
Exploitation of the fast MCDRAM memory¶
In the case of the cache mode for MCDRAM, the operating system uses this memory as a 16 GB L3 cache.
In the case of flat mode for MCDRAM memory, the allocation of data in this memory can be specified in the following two ways (at runtime with numactl
or at the application program level).
numactl command¶
When the amount of memory used by an application does not exceed 16 GB, the following command can be used to specify at runtime an allocation of all the data in the MCDRAM:
- in the case of the quadrant clustering mode:
$runcmd numactl --membind=1 ./a.out
- in the case of SNC-2 clustering mode :
$runcmd numactl --membind=2,3 ./a.out
If an application uses more than 16 GB of memory, in the case of a KNL in quadrant clustering mode, the --preferred
option of numatcl
is available to preferentially allocate 16 GB of data in the MCDRAM (the rest being allocated in the server DDR4 RAM) :
- in the case of the quadrant clustering mode:
$runcmd numactl --preferred=1 ./a.out
The $runcmd
variable above designates a mpirun
or srun
command for the execution of codes using the MPI library (see the job_KNL.sl
file in /soft/slurm/criann_modeles_scripts/
on Myria).
memkind library¶
Alternatively, the allocation in MCDRAM can be specified for this or that data at the program level of an application.
The memkind library provides fast memory allocation functions. It is used differently in FORTRAN and in C.
Case of FORTRAN¶
The compiler directive !DEC$ ATTRIBUTE FASTMEM :: VAR
can be used before an allocation statement to force it to be done in the MCDRAM.
For example :
Case of C¶
Fast memory allocation functions are available.
For example:
Compilation with memkind
¶
An application using the previous memkind directives (FORTRAN) or functions (C) can compile on Myria with the Intel compiler: apply -lmemkind
as a linker option or type module load memkind
before compiling.
Compiler options¶
The optimization option for the KNL architecture is -xMIC-AVX512
.
The following sets of optimization options can be used:
- for an executable optimized for KNL (but not executable on a general purpose Broadwell processor):
-O3 -xMIC-AVX512
- for a KNL and Broadwell optimized executable:
-O3 -axCORE-AVX2,MIC-AVX512
Submitting jobs¶
On Myria, the /soft/slurm/criann_modeles_scripts
directory includes the generic template job_KNL.sl
to submit a job on KNL with Slurm.
The job_KNL_quad_cache.sl
template corresponds to the default configuration options: quadrant modes (clustering) and cache (MCDRAM).
The knl
partition must be specified and the --constraint
directive allows to specify the clustering and MCDRAM modes. For example for quadrant and cache modes:
or for quadrant and flat modes (configuration to be requested from support@criann.fr:
Good practices¶
The PRACE project has published a best practice guide for performance on KNL: https://prace-ri.eu/training-support/best-practice-guides/best-practice-guide-knights-landing/