Index | Quick Start | Tutorials | FAQ | Contribute | API overview | API reference

Introduction

Single instruction, multiple data (SIMD) instructions or multimedia extensions have been available for many years. They are designed to significantly accelerate code execution, however they require expertise to be used correctly, depends on non-uniform compiler support, the use of low-level intrinsics, or vendor-specific libraries.

nsimd is library which aims to simplify the error-prone process of developing application exploiting the potential of SIMD instructions sets. nsimd is designed to seamlessly integrate into existing projects so that you can quickly and easily start developing high performance, portable and future proof software.

Why use nsimd?

nsimd standardizes and simplifies the use of SIMD instructions across hardware by not relying on verbose, low-level SIMD instructions. Furthermore, the portability of nsimd eliminates the need to re-write cumbersome code for each revision of each target architecture, accounting for each architecture's vendor provided API as well as architecture dependent implementation details. This greatly reduces the design complexity and maintenance of SIMD code, significantly decreasing the time required to develop, test and deploy software as well as decreasing the scope for introducing bugs.

nsimd allows you to focus on the important part of your work: the development of new features and functionality. We take care of all of the architecture and compiler specific details and we provide updates when new architectures are released by manufacturers. All you have to do is re-compile your code every time you target a new architecture.

Inside nsimd

nsimd is a vectorization library that abstracts SIMD programming. It was designed to exploit the maximum power of processors at a low development cost.

To achieve maximum performance, nsimd mainly relies on the inline optimization pass of the compiler. Therefore using any mainstream compiler such as GCC, Clang, MSVC, XL C/C++, ICC and others with nsimd will give you a zero-cost SIMD abstraction library.

To allow inlining, a lot of code is placed in header files. Small functions such as addition, multiplication, square root, etc, are all present in header files whereas big functions such as I/O are put in source files that are compiled as a .so/.dll library.

nsimd provides C89, C++98, C++11 and C++14 APIs. All APIs allow writing generic code. For the C API this is achieved through a thin layer of macros; for the C++ APIs it is achieved using templates and function overloading. The C++ API is split in two. The first part is a C-like API with only function calls and direct type definitions for SIMD types while the second one provides operator overloading, higher level type definitions that allows unrolling. C++11, C++14 APIs add for instance templated type definitions and templated constants.

Binary compatibility is guaranteed by the fact that only a C ABI is exposed. The C++ API only wraps the C calls.

nsimd Philosophy

The library aims to provide a portable zero-cost abstraction over SIMD vendor intrinsics disregarding the underlying SIMD vector length.

NSIMD was designed following as closely as possible the following guidelines:

You may wrap intrinsics that require compile time knowledge of the underlying vector length but this should be done with caution.

Wrapping intrinsics that do not exist for all types is difficult and may require casting or emulation. For instance, 8 bit integer vector multiplication using SSE2 does not exist. We can either process each pair of integers individually or we can cast the 8 bit vectors to 16 bit vectors, do the multiplication and cast them back to 8 bit vectors. In the second case, chaining operations will generate many unwanted casts.

To avoid hiding important details to the user, overloads of operators involving scalars and SIMD vectors are not provided by default. Those can be included explicitely to emphasize the fact that using expressions like scalar + vector might incur an optimization penalty.

The use of nsimd::pack may not be portable to ARM SVE and therefore must be included manually. ARM SVE registers can only be stored in sizeless structs (__sizeless_struct). This feature (as of 2019/04/05) is only supported by the ARM compiler. We do not know whether other compilers will use the same keyword or paradigm to support SVE intrinsics.

A Short Example Using nsimd

Let's take a simple case where we calculate the sum of two vectors of 32-bit floats:

for (size_t i = 0; i < N; ++i) {
  out[i] = in0[i] + in1[i];
}

Each element of the results vector is independent of every other element - therefore this function may easily be vectorized as there is latent data parallelism which may be exploited. This simple loop may be vectorized for an x86 processor using Intel intrinsic functions. For example, the following code vectorizes this loop for a SSE enabled processor:

size_t len_sse = 4;
for (size_t i = 0; i < N; i += len_sse) {
  __m128 v0_sse = _mm_load_ps(&in0[i]);
  __m128 v1_sse = _mm_load_ps(&in1[i]);
  __m128 r_sse = _mm_add_ps(v0_sse, v1_sse);
  _mm_store_ps(&out[i], r_sse);
}

Looks difficult? How about we vectorize it for the following generation of Intel processor equipped with AVX instructions:

std::size_t len_avx = 8;
for (size_t i = 0; i < N; i += len_avx) {
  __m256 v0_avx = _mm256_load_ps(&in0[i]);
  __m256 v1_avx = _mm256_load_ps(&in1[i]);
  __m256 r_avx = _mm256_add_ps(v0_avx, v1_avx);
  _mm256_store_ps(&out[i], r_avx);
}

Both of these processors are manufactured by Intel yet two different versions of the code are required to get the best performance possible from each processor. This is quickly getting complicated and annoying.

Now, look at how the code can become simpler with nsimd.

nsimd C++11 version without the advanced API:

size_t len = size_t(nsimd::len(f32()));
for (size_t i = 0; i < N; i += len) {
  // auto is nsimd::simd_vector<f32>
  auto v0 = nsimd::loada(&in0[i], f32());
  auto v1 = nsimd::loada(&in1[i], f32());
  auto r = nsimd::add(v0, v1, f32());
  nsimd::storea(&out[i], r, f32());
}

nsimd C++11 version using the advanced API (not recommended for portability with ARM SVE):

size_t len = size_t(nsimd::len(f32()));
for (size_t i = 0; i < N; i += len) {
  // auto is nsimd::pack<f32>
  auto v0 = nsimd::loada<nsimd::pack<f32> >(&in0[i]);
  auto v1 = nsimd::loada<nsimd::pack<f32> >(&in1[i]);
  auto r = v0 + v1;
  nsimd::storea(&out[i], r);
}

nsimd C++98 version without the advanced API:

size_t len = size_t(nsimd::len(f32()));
typedef nsimd::simd_traits<f32, nsimd::NSIMD_SIMD>::simd_vector vec_t;
for (size_t i = 0; i < N; i += len) {
  vec_t v0 = nsimd::loada(&in0[i], f32());
  vec_t v1 = nsimd::loada(&in1[i], f32());
  vec_t r = nsimd::add(v0, v1, f32());
  nsimd::storea(&out[i], r, f32());
}

nsimd C++98 version using the advanced API (not recommended for portability with ARM SVE):

size_t len = size_t(nsimd::len(f32()));
for (size_t i = 0; i < N; i += len) {
  nsimd::pack<f32> v0 = nsimd::loada<nsimd::pack<f32> >(&in0[i]);
  nsimd::pack<f32> v1 = nsimd::loada<nsimd::pack<f32> >(&in1[i]);
  nsimd::pack<f32> r = v0 + v1;
  nsimd::storea(&out[i], r);
}

nsimd C (C89, C99, C11) version:

size_t len = (size_t)vlen(f32);
size_t i;
for (i = 0; i < N; i += len) {
  vec(f32) v0 = vloada(&in0[i], f32);
  vec(f32) v1 = vloada(&in1[i], f32);
  vec(f32) r = vadd(v0, v1, f32);
  vstorea(&out[i], r, f32);
}

Download full source code:

Supported Compilers and Hardware by nsimd

nsimd includes support for some Intel and ARM processors. The support of IBM processors is ongoing and will be available soon.

Architecture Extensions
Intel SSE2, SSE4.2, AVX, AVX2, AVX-512 (KNL and SKYLAKE)
ARM Aarch64, NEON (ARMv7), SVE

nsimd is tested with GCC, Clang and MSVC. As a C89 and a C++98 API are provided, other compilers should work fine. Old compiler versions should work as long as they support the targeted SIMD extension. For instance, nsimd can compile on MSVC 2010 SSE4.2 code.

nsimd requires a C or a C++ compiler and is actually daily tested on the following compilers for the following hardware:

Compiler Version Architecture Extensions
GCC 8.3.0 Intel SSE2, SSE4.2, AVX, AVX2, AVX-512 (KNL and SKYLAKE)
Clang 7.0.1 Intel SSE2, SSE4.2, AVX, AVX2, AVX-512 (KNL and SKYLAKE)
GCC 8.3.0 ARM Aarch64, NEON (ARMv7), SVE
Clang 7.0.1 ARM Aarch64, NEON (ARMv7), SVE
Microsoft Visual Studio 2017 Intel SSE4.2
Intel C++ Compiler 19.0.4.243 Intel SSE2, SSE4.2, AVX, AVX2, AVX-512 (SKYLAKE)

Contributing

The wrapping of intrinsics, the writing of test and bench files are tedious and repetitive tasks. Most of those are generated using Python scripts that can be found in egg.

Please see contribute for more details.