NSIMD documentation

Overview

NSIMD scalar types

Their names follow the following pattern: Sxx where

S is i for signed integers, u for unsigned integer or f for floatting point number.
xx is the number of bits taken to represent the number.

Full list of scalar types:

f64
f32
f16
i64
i32
i16
i8
u64
u32
u16
u8

NSIMD generic SIMD vector types

In NSIMD, we call a platform an architecture e.g. Intel, ARM, POWERPC. We call SIMD extension a set of low-level functions and types provided by hardware vendors to access SIMD units. Examples include SSE2, SSE42, AVX, ... When compiling the generic SIMD vector types represents a SIMD register of the target. Examples are a __m128 for Intel SSE, __m512 for Intel AVX-512 or svfloat32_t for Arm SVE.

Their names follow the following pattern:

C base API: vSCALAR where SCALAR is a one of scalar type listed above.
C advanced API: nsimd_pack_SCALAR where SCALAR is a one of scalar type listed above.
C++ advanced API: nsimd::pack<SCALAR> where SCALAR is a one of scalar type listed above.

Full list of SIMD vector types:

Base type	C base API	C advanced API	C++ advanced API
`f64`	`vf64`	`nsimd_pack_f64`	`nsimd::pack<f64>`
`f32`	`vf32`	`nsimd_pack_f32`	`nsimd::pack<f32>`
`f16`	`vf16`	`nsimd_pack_f16`	`nsimd::pack<f16>`
`i64`	`vi64`	`nsimd_pack_i64`	`nsimd::pack<i64>`
`i32`	`vi32`	`nsimd_pack_i32`	`nsimd::pack<i32>`
`i16`	`vi16`	`nsimd_pack_i16`	`nsimd::pack<i16>`
`i8`	`vi8`	`nsimd_pack_i8`	`nsimd::pack<i8>`
`u64`	`vu64`	`nsimd_pack_u64`	`nsimd::pack<u64>`
`u32`	`vu32`	`nsimd_pack_u32`	`nsimd::pack<u32>`
`u16`	`vu16`	`nsimd_pack_u16`	`nsimd::pack<u16>`
`u8`	`vu8`	`nsimd_pack_u8`	`nsimd::pack<u8>`

C/C++ base APIs

These come automatically when you include nsimd/nsimd.h. You do not need to include a header file for having a function. Here is a list of supported platforms and their corresponding SIMD extensions.

Platform arm
- neon128
- aarch64
- sve
- sve128
- sve256
- sve512
- sve1024
- sve2048
Platform x86
- sse2
- sse42
- avx
- avx2
- avx512_knl
- avx512_skylake
Platform ppc
- vmx
- vsx
Platform cpu
- cpu

Each simd extension has its own set of SIMD types and functions. Types follow the pattern: nsimd_SIMDEXT_vSCALAR where

SIMDEXT is the SIMD extensions.
SCALAR is one of scalar types listed above.

There are also logical types associated to each SIMD vector type. These types are used, for example, to represent the result of a comparison of SIMD vectors. They are usually bit masks. Their name follow the pattern: nsimd_SIMDEXT_vlSCALAR where

SIMDEXT is the SIMD extensions.
SCALAR is one of scalar types listed above.

Note 1: Platform cpu is a 128 bits SIMD emulation fallback when no SIMD extension has been specified or is supported on a given compilation target.

Note 2: as all SIMD extensions of all platforms are different there is no need to put the name of the platform in each identifier.

Function names follow the pattern: nsimd_SIMDEXT_FUNCNAME_SCALAR where

SIMDEXT is the SIMD extensions.
FUNCNAME is the name of a function e.g. add or sub.
SCALAR is one of scalar types listed above.

Generic identifier

In the base C API, genericity is achieved using macros.

vec(SCALAR) is a type to represent a SIMD vector containing SCALAR elements. SCALAR must be one of scalar types listed above.
vecl(SCALAR) is a type to represent a SIMD vector of logicals for SCALAR elements. SCALAR must be one of scalar types listed above.
vec_a(SCALAR, SIMDEXT) is a type to represent a SIMD vector containing SCALAR elements for the simd extension SIMDEXT. SCALAR must be one of scalar types listed above and SIMDEXT must be a valid SIMD extension.
vecl_a(SCALAR, SIMDEXT) is a type to represent a SIMD vector of logicals for SCALAR elements for the simd extension SIMDEXT. SCALAR must be one of scalar types listed above and SIMDEXT must be a valid SIMD extension.
vFUNCNAME takes as input the above types to access the operator FUNCNAME e.g. vadd, vsub.

In C++98 and C++03, type traits are available.

nsimd::simd_traits<SCALAR, SIMDEXT>::vector is the SIMD vector type for platform SIMDEXT containing SCALAR elements. SIMDEXT is one of SIMD extension listed above, SCALAR is one of scalar type listed above.
nsimd::simd_traits<SCALAR, SIMDEXT>::vectorl is the SIMD vector of logicals type for platform SIMDEXT containing SCALAR elements. SIMDEXT is one of SIMD extensions listed above, SCALAR is one of scalar type listed above.

In C++11 and beyond, type traits are still available but typedefs are also provided.

nsimd::vector<SCALAR, SIMDEXT> is a typedef to nsimd::simd_traits<SCALAR, SIMDEXT>::vector.
nsimd::vectorl<SCALAR, SIMDEXT> is a typedef to nsimd::simd_traits<SCALAR, SIMDEXT>::vectorl.

The C++20 API does not bring different types for SIMD registers nor other way to access the other SIMD types. It only brings concepts instead of usual typenames. For more informations cf. concepts.md.

Note that all macro and functions available in plain C are still available in C++.

List of operators provided by the base APIs

In the documentation we use interchangeably the terms "function" and "operator". For each operator FUNCNAME a C function (also available in C++) named nsimd_SIMDEXT_FUNCNAME_SCALAR is available for each SCALAR type unless specified otherwise.

For each FUNCNAME, a C macro (also available in C++) named vFUNCNAME is available and takes as its last argument a SCALAR type.

For each FUNCNAME, a C macro (also available in C++) named vFUNCNAME_a is available and takes as its two last argument a SCALAR type and a SIMDEXT.

For each FUNCNAME, a C++ function in namespace nsimd named FUNCNAME is available. It takes as its last argument the SCALAR type and can optionnally take the SIMDEXT as its last last argument.

For example, for the addition of two SIMD vectors a and b here are the possibilities:

c = nsimd_add_avx_f32(a, b); // use AVX
c = nsimd::add(a, b, f32()); // use detected SIMDEXT
c = nsimd::add(a, b, f32(), avx()); // force AVX even if detected SIMDEXT is not AVX
c = vadd(a, b, f32); // use detected SIMDEXT
c = vadd_e(a, b, f32, avx); // force AVX even if detected SIMDEXT is not AVX

Here is a list of available FUNCNAME.

int len();
vSCALAR set1(SCALAR a0);
vlSCALAR set1l(int a0);
vSCALAR loadu(SCALAR const* a0);
vSCALAR masko_loadu1(vlSCALAR a0, SCALAR const* a1, vSCALAR a2);
vSCALAR maskz_loadu1(vlSCALAR a0, SCALAR const* a1);
vSCALARx2 load2u(SCALAR const* a0);
vSCALARx3 load3u(SCALAR const* a0);
vSCALARx4 load4u(SCALAR const* a0);
vSCALAR loada(SCALAR const* a0);
vSCALAR masko_loada1(vlSCALAR a0, SCALAR const* a1, vSCALAR a2);
vSCALAR maskz_loada1(vlSCALAR a0, SCALAR const* a1);
vSCALARx2 load2a(SCALAR const* a0);
vSCALARx3 load3a(SCALAR const* a0);
vSCALARx4 load4a(SCALAR const* a0);
vlSCALAR loadlu(SCALAR const* a0);
vlSCALAR loadla(SCALAR const* a0);
void storeu(SCALAR* a0, vSCALAR a1);
void mask_storeu1(vlSCALAR a0, SCALAR* a1, vSCALAR a2);
void store2u(SCALAR* a0, vSCALAR a1, vSCALAR a2);
void store3u(SCALAR* a0, vSCALAR a1, vSCALAR a2, vSCALAR a3);
void store4u(SCALAR* a0, vSCALAR a1, vSCALAR a2, vSCALAR a3, vSCALAR a4);
void storea(SCALAR* a0, vSCALAR a1);
void mask_storea1(vlSCALAR a0, SCALAR* a1, vSCALAR a2);
void store2a(SCALAR* a0, vSCALAR a1, vSCALAR a2);
void store3a(SCALAR* a0, vSCALAR a1, vSCALAR a2, vSCALAR a3);
void store4a(SCALAR* a0, vSCALAR a1, vSCALAR a2, vSCALAR a3, vSCALAR a4);
vSCALAR gather(SCALAR const* a0, viCALAR a1);
Only available for f64, f32, f16, i16, u16, u32, i32, i64, u64
vSCALAR gather_linear(SCALAR const* a0, int a1);
void scatter(SCALAR* a0, viCALAR a1, vSCALAR a2);
Only available for f64, f32, f16, i16, u16, u32, i32, i64, u64
void scatter_linear(SCALAR* a0, int a1, vSCALAR a2);
void storelu(SCALAR* a0, vlSCALAR a1);
void storela(SCALAR* a0, vlSCALAR a1);
vSCALAR orb(vSCALAR a0, vSCALAR a1);
vSCALAR andb(vSCALAR a0, vSCALAR a1);
vSCALAR andnotb(vSCALAR a0, vSCALAR a1);
vSCALAR notb(vSCALAR a0);
vSCALAR xorb(vSCALAR a0, vSCALAR a1);
vlSCALAR orl(vlSCALAR a0, vlSCALAR a1);
vlSCALAR andl(vlSCALAR a0, vlSCALAR a1);
vlSCALAR andnotl(vlSCALAR a0, vlSCALAR a1);
vlSCALAR xorl(vlSCALAR a0, vlSCALAR a1);
vlSCALAR notl(vlSCALAR a0);
vSCALAR add(vSCALAR a0, vSCALAR a1);
vSCALAR sub(vSCALAR a0, vSCALAR a1);
SCALAR addv(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR mul(vSCALAR a0, vSCALAR a1);
vSCALAR div(vSCALAR a0, vSCALAR a1);
vSCALAR neg(vSCALAR a0);
vSCALAR min(vSCALAR a0, vSCALAR a1);
vSCALAR max(vSCALAR a0, vSCALAR a1);
vSCALAR shr(vSCALAR a0, int a1);
Only available for i64, i32, i16, i8, u64, u32, u16, u8
vSCALAR shl(vSCALAR a0, int a1);
Only available for i64, i32, i16, i8, u64, u32, u16, u8
vSCALAR shra(vSCALAR a0, int a1);
Only available for i64, i32, i16, i8, u64, u32, u16, u8
vlSCALAR eq(vSCALAR a0, vSCALAR a1);
vlSCALAR ne(vSCALAR a0, vSCALAR a1);
vlSCALAR gt(vSCALAR a0, vSCALAR a1);
vlSCALAR ge(vSCALAR a0, vSCALAR a1);
vlSCALAR lt(vSCALAR a0, vSCALAR a1);
vlSCALAR le(vSCALAR a0, vSCALAR a1);
vSCALAR if_else1(vlSCALAR a0, vSCALAR a1, vSCALAR a2);
vSCALAR abs(vSCALAR a0);
vSCALAR fma(vSCALAR a0, vSCALAR a1, vSCALAR a2);
vSCALAR fnma(vSCALAR a0, vSCALAR a1, vSCALAR a2);
vSCALAR fms(vSCALAR a0, vSCALAR a1, vSCALAR a2);
vSCALAR fnms(vSCALAR a0, vSCALAR a1, vSCALAR a2);
vSCALAR ceil(vSCALAR a0);
vSCALAR floor(vSCALAR a0);
vSCALAR trunc(vSCALAR a0);
vSCALAR round_to_even(vSCALAR a0);
int all(vlSCALAR a0);
int any(vlSCALAR a0);
int nbtrue(vlSCALAR a0);
vSCALAR reinterpret(vSCALAR a0);
vlSCALAR reinterpretl(vlSCALAR a0);
vSCALAR cvt(vSCALAR a0);
vSCALARx2 upcvt(vSCALAR a0);
Only available for i8, u8, i16, u16, f16, i32, u32, f32
vSCALAR downcvt(vSCALAR a0, vSCALAR a1);
Only available for i16, u16, f16, i32, u32, f32, i64, u64, f64
vSCALAR rec(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR rec11(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR rec8(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR sqrt(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR rsqrt11(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR rsqrt8(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR ziplo(vSCALAR a0, vSCALAR a1);
vSCALAR ziphi(vSCALAR a0, vSCALAR a1);
vSCALAR unziplo(vSCALAR a0, vSCALAR a1);
vSCALAR unziphi(vSCALAR a0, vSCALAR a1);
vSCALARx2 zip(vSCALAR a0, vSCALAR a1);
vSCALARx2 unzip(vSCALAR a0, vSCALAR a1);
vSCALAR to_mask(vlSCALAR a0);
vlSCALAR to_logical(vSCALAR a0);
vSCALAR iota();
vlSCALAR mask_for_loop_tail(int a0, int a1);
vSCALAR adds(vSCALAR a0, vSCALAR a1);
vSCALAR subs(vSCALAR a0, vSCALAR a1);
vSCALAR sin_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cos_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR tan_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR asin_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR acos_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR atan_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR atan2_u35(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR log_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cbrt_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR sin_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cos_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR tan_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR asin_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR acos_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR atan_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR atan2_u10(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR log_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cbrt_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR exp_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR pow_u10(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR sinh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cosh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR tanh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR sinh_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cosh_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR tanh_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR asinh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR acosh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR atanh_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR exp2_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR exp2_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR exp10_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR exp10_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR expm1_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR log10_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR log2_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR log2_u35(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR log1p_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR sinpi_u05(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR cospi_u05(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR hypot_u05(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR hypot_u35(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR remainder(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR fmod(vSCALAR a0, vSCALAR a1);
Only available for f64, f32, f16
vSCALAR lgamma_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR tgamma_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR erf_u10(vSCALAR a0);
Only available for f64, f32, f16
vSCALAR erfc_u15(vSCALAR a0);
Only available for f64, f32, f16

C advanced API (only available in C11)

The C advanced API takes advantage of the C11 _Generic keyword to provide function overloading. Unlike the base API described above there is no need to pass as arguments the base type of the SIMD extension. The informations are contained in the types provided by this API.

nsimd_pack_SCALAR_SIMDEXT represents a SIMD vectors containing SCALAR elements of SIMD extension SIMDEXT.
nsimd::packl_SCALAR_SIMDEXT represents a SIMD vectors of logicals for SCALAR elements of SIMD extension SIMDEXT.

There are versions of the above type without SIMDEXT for which the targeted SIMD extension is automatically chosen.

nsimd_pack_SCALAR represents a SIMD vectors containing SCALAR elements.
nsimd::packl_SCALAR represents a SIMD vectors of logicals for SCALAR elements.

Generic types are also available:

nsimd_pack(SCALAR) is a type to represent a SIMD vector containing SCALAR elements. SCALAR must be one of scalar types listed above.
nsimd_packl(SCALAR) is a type to represent a SIMD vector of logicals for SCALAR elements. SCALAR must be one of scalar types listed above.
nsimd_pack_a(SCALAR, SIMDEXT) is a type to represent a SIMD vector containing SCALAR elements for the simd extension SIMDEXT. SCALAR must be one of scalar types listed above and SIMDEXT must be a valid SIMD extension.
nsimd_packl_a(SCALAR, SIMDEXT) is a type to represent a SIMD vector of logicals for SCALAR elements for the simd extension SIMDEXT. SCALAR must be one of scalar types listed above and SIMDEXT must be a valid SIMD extension.

Finally, operators are follow the naming: nsimd_FUNCNAME e.g. nsimd_add, nsimd_sub.

C++ advanced API

The C++ advanced API is called advanced not because it requires C++11 or above but because it makes use of the particular implementation of ARM SVE by ARM in their compiler. We do not know if GCC (and possibly MSVC in the distant future) will use the same approach. Anyway the current implementation allows us to put SVE SIMD vectors inside some kind of structs that behave like standard structs. If you want to be sure to write portable code do not use this API. Two new types are available.

nsimd::pack<SCALAR, N, SIMDEXT> represents N SIMD vectors containing SCALAR elements of SIMD extension SIMDEXT. You can specify only the first template argument. The second defaults to 1 while the third defaults to the detected SIMDEXT.
nsimd::packl<SCALAR, N, SIMDEXT> represents N SIMD vectors of logical type containing SCALAR elements of SIMD extension SIMDEXT. You can specify only the first template argument. The second defaults to 1 while the third defaults to the detected SIMDEXT.

Use N > 1 when declaring packs to have an unroll of N. This is particularily useful on ARM.

Functions that takes packs do not take any other argument unless specified otherwise e.g. the load family of funtions. It is impossible to determine the kind of pack (unroll and SIMDEXT) from the type of a pointer. Therefore in this case, the last argument must be a pack and this same type will then return. Also some functions are available as C++ operators. They follow the naming: nsimd::FUNCNAME.