In this tutorial we will write and compile a simple SIMD kernel to become familiar with the basics of NSIMD. We will also see different aspects of SIMD programming:
aligned vs. unaligned data access
basic SIMD arithmetic
SIMD loops
SIMD branching
architecture selection at runtime
SIMD programming means using the CPU SIMD registers to performs operations on several data at once. A SIMD vector should be viewed as a set of bits which are interpreted by the operators that operate on them. Taking a 128-bits wide SIMD register, it can be interpreted as:
16 signed/unsigned chars
8 signed/unsigned shorts
4 signed/unsigned ints
4 floats
2 signed/unsigned longs
2 doubles
as shown in the picture below.
We will explain the rewriting of the following kernel which uppercases ASCII letters only.
template <typename T>
void uppercase_scalar(T *dst, const T *src, int n) {
for (int i = 0; i < n; i++) {
if (src[i] >= 'a' && src[i] <= 'z') {
dst[i] = src[i] + ('A' - 'a');
} else {
dst[i] = src[i];
}
}
}
Here is the corresponding SIMD version. Explanations to follow.
template <typename T>
void uppercase_simd(T *dst, const T *src, int n) {
using namespace nsimd;
typedef pack<T> p_t;
typedef packl<T> pl_t;
int l = len<p_t>();
int i;
for (i = 0; i + l <= n; i += l) {
p_t text = loadu<p_t>(src + i);
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
p_t TEXT = if_else(mask, then_pack, text);
storeu(dst + i, TEXT);
}
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
p_t text = maskz_loadu(mask, src + i);
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
mask_storeu(mask, dst + i, TEXT);
}
All APIs of NSIMD core is available with this include:
#include <nsimd/nsimd-all.hpp>
For ease of programming with use the NSIMD namespace inside the
uppercase_simd
function.
using namespace nsimd;
A nsimd::pack<T>
can be considered analogous to a SIMD register (on your or
any other machine). Operations performed on packs - from elementary operations
such as addition to complicated functions such as nsimd::rsqrt11(x)
- will be
performed using SIMD registers and operations if supported by your hardware. As
shown below, data must be manually loaded into and stored from these registers.
Again, for ease of programming we typedef a pack of T's.
typedef pack<T> p_t;
NSIMD provides another type of pack called nsimd::packl
which handles vectors
of booleans.
typedef packl<T> pl_t;
This distinction between pack's and packl's is necessary ffor two reasons:
On recent hardware, SIMD vectors of booleans are handled by dedicated registers.
Pack and Packl must have different semantics as arithmetic operators on booleans have no sense as well as logical operators on Pack's.
One way to construct a nsimd::pack<T>
is to simply declare
(default-construct) it. Such a pack may not be zero-initialized and thus may
contain arbitrary values.
Another way to construct a nsimd::pack<T>
is to fill it with a single value.
This so-called splatting constructor takes one scalar value and replicates it
in all elements of the pack.
But most common usage to construct a nsimd::pack<T>
is by using the copy
constructor from loading functions.
p_t text = loadu<p_t>(src + i);
Alignement of a given pointer ptr
to memory to some value A
means that
ptr % A == 0
. On older hardware loading data from unaligned memory can
result in performance penalty. On recent hardware it is hard to exhibit a
difference. NSIMD provides two versions of "load":
loada
for loading data from aligned memory
loadu
for loading data from unaligned momery
Note that using loada
on unaligned pointer may result in segfaults. As
recent hardware have good support for unaligned memory we use loadu
.
p_t text = loadu<p_t>(src + i);
To ensure that data allocated by std::vector
is aligned, NSIMD provide
a C++ allocator.
std::vector<T, nsimd::allocator<T> > data;
When loading data from memory you must ensure that there is sufficient data in
the block of memory you load from to fill a nsimd::pack<T>
. For example, on
an AVX
capable machine, a SIMD vector of float
(32 bits) contains 8
elements. Therefore, there must be at least 8 floats in the memory block you
load data from otherwise loading may result in segfaults. More on this below.
Once initialized, nsimd::pack<T>
instances can be used to perform arithmetic.
Usual operations are provided by NSIMD such:
addition
substraction
multiplication
division
square root
bitwise and/or/xor
...
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
C++ operators are also overloaded for pack's and packl's as well as between pack's and scalars or packl's and booleans.
NSIMD provide the if_else
operator which fill the output, lane by lane,
according to the lane value of its first argument:
if it is true, the output lane will be filled with the second argument's lane
if it is false, the output lane will be filled with the third argument's lane
Therefore the branching:
if (src[i] >= 'a' && src[i] <= 'z') {
dst[i] = src[i] + ('A' - 'a');
} else {
dst[i] = src[i];
}
will be rewritten as
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
p_t TEXT = if_else(mask, then_pack, text);
or as a one liner
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
A SIMD loop is similar to its scalar counterpart except that instead of going through data one element at a time it goes 4 by 4 or 8 by 8 elements at a time. More precisely SIMD loops generally goes from steps equal to pack's length. Therefore the scalar loop
for (int i = 0; i < n; i++) {
is rewritten as
int l = len<p_t>();
int i;
for (i = 0; i + l <= n; i += l) {
Note that going step by step will only cover most of the data except maybe the
tail of data in case that the number of elements is not a multiple of the
Pack's length. Therefore to perform computations on the tail one has to
load data from only n
elements where n < len<p_t>()
. One can use
maskz_loadu
which will load data only on lanes that are marked as true by
another argument to the function.
p_t text = maskz_loadu(mask, src + i);
The mask can be computed manually but NSIMD provides a function for it.
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
Then the computation on the tail is exactly the same as within the loop. Put together it gives for the tail:
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
p_t text = maskz_loadu(mask, src + i);
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
mask_storeu(mask, dst + i, TEXT);
Then the entire loop reads as follows.
int i;
for (i = 0; i + l <= n; i += l) {
p_t text = loadu<p_t>(src + i);
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
p_t TEXT = if_else(mask, then_pack, text);
storeu(dst + i, TEXT);
}
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
p_t text = maskz_loadu(mask, src + i);
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
mask_storeu(mask, dst + i, TEXT);
Here is the complete listing of the code.
#include <nsimd/nsimd-all.hpp>
#include <string>
#include <vector>
#include <iostream>
template <typename T>
void uppercase_scalar(T *dst, const T *src, int n) {
for (int i = 0; i < n; i++) {
if (src[i] >= 'a' && src[i] <= 'z') {
dst[i] = src[i] + ('A' - 'a');
} else {
dst[i] = src[i];
}
}
}
template <typename T>
void uppercase_simd(T *dst, const T *src, int n) {
using namespace nsimd;
typedef pack<T> p_t;
typedef packl<T> pl_t;
int l = len<p_t>();
int i;
for (i = 0; i + l <= n; i += l) {
p_t text = loadu<p_t>(src + i);
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
p_t TEXT = if_else(mask, then_pack, text);
storeu(dst + i, TEXT);
}
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
p_t text = maskz_loadu(mask, src + i);
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
mask_storeu(mask, dst + i, TEXT);
}
int main(int argc, char **argv) {
std::string input;
for (int i = 1; i < argc; i++) {
input += std::string(argv[i]);
if (i < argc - 1) {
input += std::string(" ");
}
}
std::cout << "Orignal text : " << input << std::endl;
std::vector<i8> dst_scalar(input.size() + 1);
uppercase_scalar(&dst_scalar[0], (i8 *)input.c_str(), (int)input.size());
std::cout << "Scalar uppercase text: " << &dst_scalar[0] << std::endl;
std::vector<i8> dst_simd(input.size() + 1);
uppercase_simd(&dst_simd[0], (i8 *)input.c_str(), (int)input.size());
std::cout << "NSIMD uppercase text : " << &dst_simd[0] << std::endl;
return 0;
}
The compilation of a program using nsimd
is like any other library.
c++ -O3 -DAVX2 -mavx2 -L/path/to/lib -lnsimd_avx2 -I/path/to/include tutorial.cpp
When compiling with NSIMD, you have to decide at compile time the targeted
SIMD extensions, AVX2 in the example above. It is therefore necessary to
give -mavx2
to the compiler for it to emit AVX2 instructions. To tell NSIMD
that AVX2 has to be used the -DAVX2
has to be passed to the compiler. For
an exhaustive list of defines controlling compilation see defines.md. There
is a .so file for each SIMD extension, it is therefore necessary to link
against the proper .so file.
It is sometimes necessary to have several versions of a given algorithm for
different SIMD extensions. This is rather to do with NSIMD. Basically the
idea is to write the algorithm in a generic manner using pack's as shown above.
It is then sufficient to compile the same soure file for different SIMD
extensions and then link the resulting object files altogether. Suppose that
a file named uppercase.cpp
contains the following code:
template <typename T>
void uppercase_simd(T *dst, const T *src, int n) {
using namespace nsimd;
typedef pack<T> p_t;
typedef packl<T> pl_t;
int l = len<p_t>();
int i;
for (i = 0; i + l <= n; i += l) {
p_t text = loadu<p_t>(src + i);
pl_t mask = text >= 'a' && text <= 'z';
p_t then_pack = text + ('A' - 'a');
p_t TEXT = if_else(mask, then_pack, text);
storeu(dst + i, TEXT);
}
pl_t mask = mask_for_loop_tail<pl_t>(i, n);
p_t text = maskz_loadu(mask, src + i);
p_t TEXT = if_else(text >= 'a' && text <= 'z', text + ('A' - 'a'), text);
mask_storeu(mask, dst + i, TEXT);
}
This would give the following in a Makefile.
all: uppercase
uppercase_sse2.o: uppercase.cpp
c++ -O3 -DSSE2 -msse2 -c $? -o $@
uppercase_sse42.o: uppercase.cpp
c++ -O3 -DSSE42 -msse4.2 -c $? -o $@
uppercase_avx.o: uppercase.cpp
c++ -O3 -DAVX -mavx -c $? -o $@
uppercase_avx2.o: uppercase.cpp
c++ -O3 -DAVX2 -mavx2 -c $? -o $@
uppercase: uppercase_sse2.o \
uppercase_sse42.o \
uppercase_avx.o \
uppercase_avx2.o
main.cpp
c++ $? -lnsimd_avx2 -o $@
Note that libnsimd_avx2
contains all the functions for SSE 2, SSE 4.2, AVX
and AVX2. This is a consequence of the retrocompatiblity of Intel SIMD
extensions. The situation is the same on ARM where libnsimd_sve.so
will
contain functions for AARCH64.
There is a small caveat. The symbol name corresponding to the uppercase_simd
function will be same for all the object files which will result in error
when linking together all objects. To avoid this situation one can use
function overloading as follows:
template <typename T>
void uppercase_simd(NSIMD_SIMD, T *dst, const T *src, int n) {
// ...
}
The macro NSIMD_SIMD
will be expanded to a type containing the information on
the SIMD extension currently requested by the user. This techniques is called
tag dispatching and does not require any modification of the algorithm
inside the function. Finally in main
one has to do dispatching by using
either cpuid
of by another mean.
int main() {
// what follows is pseudo-code
switch(cpuid()) {
case cpuid_sse2:
uppercase(nsimd::sse2, dst, src, n);
break;
case cpuid_sse42:
uppercase(nsimd::sse42, dst, src, n);
break;
case cpuid_avx:
uppercase(nsimd::avx, dst, src, n);
break;
case cpuid_avx2:
uppercase(nsimd::avx2, dst, src, n);
break;
}
return 0;
}