Expression templates are a C++ template metaprogramming technique that essentially allows high level programming for loop fusion. Take the following exemple.
std::vector<float> operator+(std::vector<float> const &a,
std::vector<float> const &b) {
std::vector<float> ret(a.size());
for (size_t i = 0; i < a.size(); i++) {
ret[i] = a[i] + b[i];
}
return ret;
}
int main() {
std::vector<float> a, b, c, d, sum;
...
sum = a + b + c + d;
...
return 0;
}
The expression a + b + c + d
involves three calls to operator+
and at least
nine memory passes are necessary. This can be optimized as follows.
int main() {
std::vector<float> a, b, c, d, sum;
...
for (size_t i = 0; i < a.size(); i++) {
ret[i] = a[i] + b[i] + c[i] + d[i];
}
...
return 0;
}
The rewriting above requires only four memory passes which is of course better
but as humans we prefer the writing a + b + c + d
. Expression templates
solves exactly this problem and allows the programmer to write a + b + c + d
and the compiler to see the loop written above.
This module provides expression templates on top of NSIMD core. As a consequence the loops seen by the compiler deduced from the high-level expressions are optimized using SIMD instructions. Note also that NVIDIA and AMD GPUs are supported through CUDA and ROCm/HIP. The API for expression templates in NSIMD is C++98 compatible and is able to work with any container as its only requirement for data is that it must be contiguous.
All inputs to an expression must be declared using tet1d::in
while the
output must be declared using tet1d::out
.
int main() {
std::vector<float> a, b, c;
...
tet1d::out(a) = tet1d::in(&a[0], a.size()) + tet1d::in(&b[0], b.size());
...
return 0;
}
template <typename T, typename I> inline node in(const T *data, I sz);
Construct an input for expression templates starting at address data
and
containing sz
elements. The return type of this functin node
can be used
with the help of the TET1D_IN(T)
macro where T
if the underlying type of
data (ints, floats, doubles...).
template <typename T> node out(T *data);
Construct an output for expression templates starting at address data
. Note
that memory must be allocated by the user before passing it to the expression
template engine. The output type can be used with the TET1D_OUT(T)
where
T
is the underlying type (ints, floats, doubles...).
Note that it is possible to pass parameters to the expression template engine to specify the number of threads per block for GPUs or the SIMD extension to use...
template <typename T, typename Pack> node out(T *data, int
threads_per_block, void *stream);
Construct an output for expression templates starting at address data
. Note
that memory must be allocated by the user before passing it to the expression
template engine. The Pack
parameter is useful when compiling for CPUs. The
type is nsimd::pack<...>
allowing the developper to specify all details
about the NSIMD packs that will be used by the expression template engine.
The threads_per_block
and stream
arguments are used only when compiling
for GPUs. Their meaning is contained in their names. The output type can be
used with the TET1D_OUT_EX(T, N, SimdExt)
where T
is the underlying type
(ints, floats, doubles...), N
is the unroll factor and SimdExt
the SIMD
extension.
Moreover a MATLAB-like syntax is provided. One can select a subrange of given
input. Indexes are understood as for Python: -1 represents the last element.
The contant tet1d::end = -1
allows one to write portable code.
int main() {
std::vector<float> a, b, c;
...
TET1D_IN(float) va = tet1d::in(&a[0], a.size());
TET1D_IN(float) vb = tet1d::in(&b[0], b.size());
tet1d::out(c) = va(10, tet1d::end - 10) + vb;
...
return 0;
}
One can also specify which elements of the output must be rewritten with the following syntax.
int main() {
std::vector<float> a, b, c;
...
TET1D_IN(float) va = tet1d::in(&a[0], a.size());
TET1D_IN(float) vb = tet1d::in(&b[0], b.size());
TET1D_OUT(float) vc = tet1d::out(&c[0]);
vc(va >= 10 && va < 20) = vb;
...
return 0;
}
In the exemple above, element i
in vc
is written only if va[i] >= 10
and
va[i] < 20
. The expression appearing in the parenthesis can contain
arbitrary expression templates as soon as the underlying type is bool
.
auto
Using auto can lead to surprising results. We advice you never to use auto
when dealing with expression templates. Indeed using auto
will make the
variable an obscure type representing the computation tree of the expression
template. This implies that you won't be able to get data from this variable
i.e. get the .data
member for exemple. Again this variable or its type cannot
be used in template arguments where you need it.