The KNL is a processor with a large number of cores (64 to 72) at a moderate frequency (1.3 to 1.5 GHz). A SMT (Simultaneous Multi Threading) mode allows each core to run four processes if the user wishes.
This architecture is suitable for parallel applications (OpenMP or multithreaded, MPI or MPI+OpenMP in particular), which accelerate to at least a few dozen processes. MPI+OpenMP hybridization can help performance compared to the pure MPI approach.
The KNL processor has an on-board fast memory area (MCDRAM) with a size of 16 GB.
A KNL server can be configured in three different modes for the use of this memory by an application.
In this mode, the MCDRAM memory is used by the system as a large (16 GB) third-level cache ("L3 cache"): data from the RAM (DDR4) of a KNL server are periodically stored in this fast memory area close to the processor.
In this mode, the MCDRAM memory is an addressable memory space, in which an application can allocate all or part of its data (the others remaining allocated in the server RAM). These allocations can be specified in the program or by execution commands (this is specified in the paragraph "Exploiting the fast MCDRAM").
In the hybrid mode, a quarter or half (4 or 8 GB) of the MCDRAM is L3 cache, the rest (8 or 12 GB) is addressable memory space.
The main modes of clustering, i.e. of interconnecting the cores within a KNL, are the following. The notion of NUMA (Non Uniform Memory Access) entity, mentioned below, is an architecture in which the cores access certain areas of the memory more quickly than others.
In quadrant mode, the KNL cores are organized into four groups, each located near a memory controller. A KNL in quadrant mode is seen by the system as a single NUMA entity.
In SNC (Sub Numa Clustering) modes, the KNL is subdivided into two (SNC-2 mode) or four (SNC-4 mode) NUMA entities.
Configuration on Myria¶
Myria consists of ten compute nodes, each with a KNL model 7210, i.e., 64 cores (4 threads per core), 1.3 GHz CPU frequency with 96 GB of DDR4 RAM at 2133 MHz per node.
These machines (as well as Myria's general purpose Broadwell processor machines) are interconnected by a high-speed (100 Gbps), low latency Omni-Path network. The Omni-Path controller is integrated into the KNL chip.
These ten KNL nodes are configured (statically) in quadrant clustering mode and in cache memory mode for the embedded 16 GB MCDRAM.
This configuration can be changed on multiple nodes (e.g., to test flat mode for MCDRAM) upon request to email@example.com.
Exploitation of the fast MCDRAM memory¶
In the case of the cache mode for MCDRAM, the operating system uses this memory as a 16 GB L3 cache.
In the case of flat mode for MCDRAM memory, the allocation of data in this memory can be specified in the following two ways (at runtime with
numactl or at the application program level).
When the amount of memory used by an application does not exceed 16 GB, the following command can be used to specify at runtime an allocation of all the data in the MCDRAM:
- in the case of the quadrant clustering mode:
$runcmd numactl --membind=1 ./a.out
- in the case of SNC-2 clustering mode :
$runcmd numactl --membind=2,3 ./a.out
If an application uses more than 16 GB of memory, in the case of a KNL in quadrant clustering mode, the
--preferred option of
numatcl is available to preferentially allocate 16 GB of data in the MCDRAM (the rest being allocated in the server DDR4 RAM) :
- in the case of the quadrant clustering mode:
$runcmd numactl --preferred=1 ./a.out
$runcmd variable above designates a
srun command for the execution of codes using the MPI library (see the
job_KNL.sl file in
/soft/slurm/criann_modeles_scripts/ on Myria).
Alternatively, the allocation in MCDRAM can be specified for this or that data at the program level of an application.
The memkind library provides fast memory allocation functions. It is used differently in FORTRAN and in C.
Case of FORTRAN¶
The compiler directive
!DEC$ ATTRIBUTE FASTMEM :: VAR can be used before an allocation statement to force it to be done in the MCDRAM.
For example :
Case of C¶
Fast memory allocation functions are available.
An application using the previous memkind directives (FORTRAN) or functions (C) can compile on Myria with the Intel compiler: apply
-lmemkind as a linker option or type
module load memkind before compiling.
The optimization option for the KNL architecture is
The following sets of optimization options can be used:
- for an executable optimized for KNL (but not executable on a general purpose Broadwell processor):
- for a KNL and Broadwell optimized executable:
On Myria, the
/soft/slurm/criann_modeles_scripts directory includes the generic template
job_KNL.sl to submit a job on KNL with Slurm.
job_KNL_quad_cache.sl template corresponds to the default configuration options: quadrant modes (clustering) and cache (MCDRAM).
knl partition must be specified and the
--constraint directive allows to specify the clustering and MCDRAM modes. For example for quadrant and cache modes:
or for quadrant and flat modes (configuration to be requested from firstname.lastname@example.org:
The PRACE project has published a best practice guide for performance on KNL: https://prace-ri.eu/training-support/best-practice-guides/best-practice-guide-knights-landing/