nsimd::pack
as a std::vector
?No, these are two very different objects. A nsimd::pack
represent a SIMD
register whereas a std::vector
represents a chunk of memory. You should
separate concerns and use std::vector
to store data in your structs or
classes, nsimd::pack
should only be used in computation kernels and nowhere
else especially not in structs or classes.
There are several reasons which can reduce the speed-up:
Have you enabled compiler optimizations? You must enable all compiler
optimizations (like -O3
).
Have you compiled in 64 bit mode? There is significant performance increase on architectures supporting 64 bit binaries.
Is your code trivially vectorizable? Modern compilers can vectorize trivial code segments automatically. If you benchmark a trivial scalar code versus a vectorized code, the compiler may vectorize the scalar code, thereby giving similar performance to the vectorized version.
Some architectures do not provides certains functionnalities. For example
AVX2 chips do not provide a way to convert long to double. So using
nsimd::cvt<f64>
will produce an emulation for-loop in the resulting
binary. To know which intrinsics are used by NSIMD you can consult
wrapped_intrinsics.md.
The most common cause of segfaults in SIMD codes is accessing non-aligned memory. For best performance, all memory should be aligned. NSIMD includes an aligned memory allocation function and an aligned memory allocator to help you with this. Please refer to tutorials.md for details on how to ensure that you memory is correctly aligned.
Another common cause is to read or write data beyond the allocated memory. Do not forget that loading data into a SIMD vector will result in loading 16 bytes (or 4 floats) from memory. If this read occurs at the last 2 elements of allocated memory then a segfault will be generated.
Not all SSE instructions have an equivalent AVX instruction. As a consequence
NSIMD uses two SSE operations to emulate the equivalent AVX operation. Also,
the cycles required for certain instructions are not equal on both
architectures, for example, sqrt
on SSE
requires 13-14 cycles whereas
sqrt
on AVX
requires 21-28 cycles. Please refer
here for more
information.
Very few integer operations are supported on AVX, AVX2 is required for most integer operations. If a NSIMD function is called on an integer AVX register, this register will be split into two SSE registers and the equivalent instruction called on both register. In the case, no speed-up will be observed compared with SSE code. This is true also on POWER 7, where double is not supported.
Have you compiled in release mode, with full optimizations options?
Have you used a 64 bit compiler?
There are many SIMD related bugs across all compilers, and some compilers generate less than optimal code in some cases. Is it possible to update your compiler to a more modern compiler?
We provide workarounds for several compiler bugs, however, we may have
missed some. You may also have found a bug in nsimd
. Please report this
through issues on our github with a minimal code example. We responds quickly
to bug reports and do our best to patch them as quickly as possible.
If you require a certain intrinsic, you may search inside of NSIMD for it and then call the relevant function or look at wrapped_intrinsics.md.
In rare cases, the intrinsic may not be included in NSIMD as we map the intrinsic wherever it makes sense semantically. If a certain intrinsic does not fit inside of this model, if may be excluded. In this case, you may call it yourself, however, note this will not be portable.
To use a particular intrinsic say _mm_avg_epu8
, you can write the following.
nsimd::pack<u8> a, b, result;
result = nsimd::pack<u8>(_mm_avg_epu8(a.native_register(),
b.native_register()));
Use nsimd::to_mask
and
nsimd::to_logical
.
General shuffles are not provided by NSIMD. You can see issue 8 on github. For now we provide only some length agnostic shuffles such as zip and unzip, see the shuffle API at the Shuffle section.
No. You are welcome to contribute to NSIMD and add them as a NSIMD module. You should use expressions templates instead. Strictly conforment STL algorithms do not provide means to control for example the unroll factor or the number of threads per block when compiling for GPUs.
Yes, we provide masked loads and stores, see the api at the
"Loads & stores" section. We also provide the
nsimd::mask_for_loop_tail
which computes the
mask for ending loops. But note that using these is not recommanded as on
most architectures there are no intrinsic. This will result in slow code. It
is recommanded to finish loops using a scalar implementation.
Yes, we provide gathers and scatters, see the api at the "Loads & stores" section. Note also that as most architectures do not provide such intrinsics and so this could result in slow code.
Autodetecting the SIMD extension is compiler/compiler version/cpu/system dependant which means a lot of code for a (most likely buggy) feature which can be an inconvenience sometimes. Plus some compilers do not permit this feature. For example cf. https://www.boost.org/doc/libs/1_71_0/doc/html/predef/reference.html and https://msdn.microsoft.com/en-us/library/b0084kay.aspx. Thus a "manual" system is always necessary.
This is because of C++ and our will not to use C++-useless-complicated stuff.
Taking the example with if_else
, suppose that we have called it "if_else"
without the "1". When working with packs, one wants to be able to use if_else
in this manner:
int main() {
using namespace nsimd;
typedef pack<int> pi;
typedef pack<float> pf;
int n;
int *a, *b; // suppose both points to n ints
float *fa, *fb; // suppose both points to n floats
for (int i = 0; i < n; i += len()) {
packl<int> cond = (loada<pi>(&a[i]) < loada<pi>(&b[i]));
storea(&fb[i], if_else(cond, load<pf>(&fb[i]), set1<pf>(0.0f)));
}
return 0;
}
But this causes a compiler error, the overload of if_else
is ambiguous.
Sure one can use many C++-ish techniques to tackle this problem but we chose
not to as the goal is to make the life of the compiler as easy as possible.
So as we want to favor the C++ advanced API as it is the most human readable,
users of the C and C++ base APIs will have to use if_else1
.