Skip to content



The KNL is a processor with a large number of cores (64 to 72) at a moderate frequency (1.3 to 1.5 GHz). A SMT (Simultaneous Multi Threading) mode allows each core to run four processes if the user wishes.

This architecture is suitable for parallel applications (OpenMP or multithreaded, MPI or MPI+OpenMP in particular), which accelerate to at least a few dozen processes. MPI+OpenMP hybridization can help performance compared to the pure MPI approach.


The KNL processor has an on-board fast memory area (MCDRAM) with a size of 16 GB.

A KNL server can be configured in three different modes for the use of this memory by an application.

  • cache mode

    In this mode, the MCDRAM memory is used by the system as a large (16 GB) third-level cache ("L3 cache"): data from the RAM (DDR4) of a KNL server are periodically stored in this fast memory area close to the processor.

  • flat mode

    In this mode, the MCDRAM memory is an addressable memory space, in which an application can allocate all or part of its data (the others remaining allocated in the server RAM). These allocations can be specified in the program or by execution commands (this is specified in the paragraph "Exploiting the fast MCDRAM").

  • hybride mode

    In the hybrid mode, a quarter or half (4 or 8 GB) of the MCDRAM is L3 cache, the rest (8 or 12 GB) is addressable memory space.


Clustering modes

The main modes of clustering, i.e. of interconnecting the cores within a KNL, are the following. The notion of NUMA (Non Uniform Memory Access) entity, mentioned below, is an architecture in which the cores access certain areas of the memory more quickly than others.

  • Quadrant mode

    In quadrant mode, the KNL cores are organized into four groups, each located near a memory controller. A KNL in quadrant mode is seen by the system as a single NUMA entity.

  • SNC modes

    In SNC (Sub Numa Clustering) modes, the KNL is subdivided into two (SNC-2 mode) or four (SNC-4 mode) NUMA entities.

Configuration on Myria

Myria consists of ten compute nodes, each with a KNL model 7210, i.e., 64 cores (4 threads per core), 1.3 GHz CPU frequency with 96 GB of DDR4 RAM at 2133 MHz per node.

These machines (as well as Myria's general purpose Broadwell processor machines) are interconnected by a high-speed (100 Gbps), low latency Omni-Path network. The Omni-Path controller is integrated into the KNL chip.

These ten KNL nodes are configured (statically) in quadrant clustering mode and in cache memory mode for the embedded 16 GB MCDRAM.

This configuration can be changed on multiple nodes (e.g., to test flat mode for MCDRAM) upon request to

Exploitation of the fast MCDRAM memory

In the case of the cache mode for MCDRAM, the operating system uses this memory as a 16 GB L3 cache.

In the case of flat mode for MCDRAM memory, the allocation of data in this memory can be specified in the following two ways (at runtime with numactl or at the application program level).

numactl command

When the amount of memory used by an application does not exceed 16 GB, the following command can be used to specify at runtime an allocation of all the data in the MCDRAM:

  • in the case of the quadrant clustering mode: $runcmd numactl --membind=1 ./a.out
  • in the case of SNC-2 clustering mode : $runcmd numactl --membind=2,3 ./a.out

If an application uses more than 16 GB of memory, in the case of a KNL in quadrant clustering mode, the --preferred option of numatcl is available to preferentially allocate 16 GB of data in the MCDRAM (the rest being allocated in the server DDR4 RAM) :

  • in the case of the quadrant clustering mode: $runcmd numactl --preferred=1 ./a.out

The $runcmd variable above designates a mpirun or srun command for the execution of codes using the MPI library (see the file in /soft/slurm/criann_modeles_scripts/ on Myria).

memkind library

Alternatively, the allocation in MCDRAM can be specified for this or that data at the program level of an application.

The memkind library provides fast memory allocation functions. It is used differently in FORTRAN and in C.


The compiler directive !DEC$ ATTRIBUTE FASTMEM :: VAR can be used before an allocation statement to force it to be done in the MCDRAM.

For example :


! FASTMEM attribute


! A is allocated in MCDRAM

ALLOCATE (A(1:1024))

! B is allocated in DDR4

ALLOCATE (B(1:1024))

Case of C

Fast memory allocation functions are available.

For example:

#include <hbwmalloc.h> /* hbwmalloc interface */

/* ... */

const int n = 1<<10;

double* A = (double*) hbw_malloc (sizeof(double)*n); /* Allocation to HBM (high bandwidth memory, MCDRAM) */

/* ... */

hbw_free (A); /* Deallocate with hbw_free */

Compilation with memkind

An application using the previous memkind directives (FORTRAN) or functions (C) can compile on Myria with the Intel compiler: apply -lmemkind as a linker option or type module load memkind before compiling.

Compiler options

The optimization option for the KNL architecture is -xMIC-AVX512.

The following sets of optimization options can be used:

  • for an executable optimized for KNL (but not executable on a general purpose Broadwell processor): -O3 -xMIC-AVX512
  • for a KNL and Broadwell optimized executable: -O3 -axCORE-AVX2,MIC-AVX512

Submitting jobs

On Myria, the /soft/slurm/criann_modeles_scripts directory includes the generic template to submit a job on KNL with Slurm.

The template corresponds to the default configuration options: quadrant modes (clustering) and cache (MCDRAM).

The knl partition must be specified and the --constraint directive allows to specify the clustering and MCDRAM modes. For example for quadrant and cache modes:

#SBATCH --constraint quad,cache

or for quadrant and flat modes (configuration to be requested from

#SBATCH --constraint quad,flat

Good practices

The PRACE project has published a best practice guide for performance on KNL:

Last update: November 25, 2022 14:05:21