SPMD stands for Single Program Multiple Data. It is a programming paradigm.
It is used by NVIDIA CUDA. Its strengh lies in writing computation kernels.
Basically you concentrate your attention on the kernel itself and not on
how to run it. An example is worth more than a long speech, let's take vector
addition of float
's.
spmd_kernel_1d(add, float *dst, float *a, float *b)
k_store(dst, k_load(a) + k_load(b));
spmd_kernel_end
It would be written as follows for CUDA (assuming that the vector lenghts are multiples of block's sizes).
__global__ add(float *dst, float *a, float *b) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
dst[i] = a[i] + b[i];
}
NSIMD's SPMD is a small DSL in standard C++98 that can be used to write computation kernels for GPUs (NVIDIA's and AMD's) and any SIMD units supported by NSIMD. On a more technical side, the DSL keywords are macros that:
translates to C-ish keywords for GPUs and
use masks for CPUs as Intel ISPC (https://ispc.github.io/).
The difference between NSIMD's SPMD is that a single code can be compiled to target GPUs and CPUs whereas:
NVIDIA CUDA only targets NVIDIA GPUs
AMD HIP only targets NVIDIA and AMD GPUs
INTEL ICP only targets Intel SIMD units and ARM NEON
As for CUDA kernels you can write templated and non-templated CUDA kernels. Declaring a kernel function and launching it is straight forward:
spmd_kernel_1d(kernel_name, arguments)
// kernel code
spmd_kernel_end
int main() {
spmd_launch_kernel_1d(kernel_name, bit_width, param,
vector_size, arguments);
return 0;
}
The bit_width
argument indicates the types width in bits that will be
available inside kernels. The param
argument indicates the unroll factor for
CPUs and the number of threads per block for GPUs. The vector_size
argument
indicates the vectors length passed as arguments.
Device functions can also been implemented. They are functions that will only run on the device. As for kernels, they have the same restrictions.
spmd_dev_func(k_float device_func, k_float a, k_float b)
// Device function code
spmd_dev_func_end
spmd_kernel_1d(kernel, arguments)
// ...
spmd_call_dev_func(device_func, a, b);
// ...
spmd_kernel_end
The caveat with spmd_dev_func
is that its first argument must be the return
type followed by the device function name.
It is also possible to write templated kernels. Due to C++ __VA_ARGS__
limitations the number of template argument is limited to one of kind
typename
. If more types or integers are to be passed to device kernels or
functions they have to be boxed inside a struct.
struct mul_t {
spmd_dev_func(static k_float dev_impl, k_float a, k_float b)
return a * b;
spmd_dev_func_end
};
struct add_t {
spmd_dev_func(static k_float dev_impl, k_float a, k_float b)
return a + b;
spmd_dev_func_end
};
// Op is the template argument (typename Op in C++ code)
spmd_tmpl_dev_func(k_float trampoline, Op, k_float a, k_float b)
return Op::template spmd_call_dev_func(dev_impl, a, b);
spmd_dev_func_end
// Op is the template argument (typename Op in C++ code)
spmd_tmpl_kernel_1d(tmpl_kernel, Op, arguments)
// ...
spmd_call_tmpl_dev_func(trampoline, Op, a, b);
// ...
spmd_kernel_end
int main() {
// Kernel call for addition
spmd_launch_tmpl_kernel_1d(tmpl_kernel, add_t, 32, 1, N, arguments);
// Kernel call for multiplication
spmd_launch_tmpl_kernel_1d(tmpl_kernel, mul_t, 32, 1, N, arguments);
return 0;
}
The DSL is of course constraint by C++ syntax and constructs. This implies
some strange syntax and the impossibility to use infix operator =
.
For now (2020/05/16) the NSIMD SPMD DSL does only supports if
's, while-loops
and returns
. It seems that for-loops and do-while-loops cannot be nicely
proposed, i.e. with a nice syntax, the switch-case keywords cannot be
implemented with a good conformence to the semantic of their C++ counterparts.
Goto's also cannot be implemented properly.
The following self-explanatory variable types are available inside kernels and devices functions:
k_int
for signed integers
k_uint
for unsigned integers
k_float
for floatting point numbers
k_bool
for booleans
As explained above the bit-width of the above types are determined by the
launch kernel function. Note that k_float
does not exists for 8-bits types.
Given a pointer, the proper way to load data is to use k_load(ptr)
. For
storing a value to memory k_store
is to be used.
k_store(ptr, value);
k_store(ptr, expression);
As explained above, there is no need to compute the offset to apply to pointers. This is hidden from the programmer.
operator=
)Due to C++ ADL (https://en.cppreference.com/w/cpp/language/adl) and the
need for keeping things simple for the compiler (which does not always mean
simple for the programmer) the use of infix operator =
will not produce
a copmilation error but will give incorrect result. You should use k_set
.
k_set(var, value);
k_set(var, expression);
As written above, k_set
assign value or the result of an expression to a
variable.
You should not use plan C++ if
's or else
's. This will not cause compilation
error but will produce incorrect results at runtime. You should use k_if
,
k_else
, k_elseif
and k_endif
instead. they have the same semantic as
their C++ counterparts.
spmd_kernel_1d(if_elseif_else, float *dst, float *a_ptr)
k_float a, ret;
k_set(a, k_load(a_ptr));
k_if (a > 15.0f)
k_set(ret, 15.0f);
k_elseif ( a > 10.0f)
k_set(ret, 10.0f);
k_elseif ( a > 5.0f)
k_set(ret, 5.0f);
k_else
k_set(ret, 0.0f);
k_endif
k_store(dst, ret);
spmd_kernel_end
You should not use plan C++ while
's, break
's and continue
's. This will
not cause compilation error but will produce incorrect results at runtime.
You should use k_while
, k_break
, k_continue
and k_endif
instead. They
have the same semantic as their C++ counterparts.
spmd_kernel_1d(binpow, float *dst, float *a_ptr, int *p_ptr)
k_float a, ret;
k_set(a, k_load(a_ptr));
k_set(ret, 1.0f);
k_int p;
k_set(p, k_load(p_ptr));
k_while(p > 0)
k_if ((p & 1) != 0)
k_set(ret, ret * a);
k_endif
k_set(a, a * a);
k_set(p, p >> 1);
k_endwhile
k_store(dst, ret);
spmd_kernel_end
Returns cannot be implemented as macros overloading is not possible in a
standard way with an overload taking zero arguments. So returning has to be
done correctly. The k_return
keyword has the same semantic as the C++
return
keyword without arguments and can be used at will for kernels (as
kernels return type is always void
) and for device functions returning
void
.
For device functions returning a value it is recommanded to proceed this way:
Declare a variable, say ret
, to store the return value.
Whereever you need to return, set the variable appropriately with k_set
and return with k_return
.
At the end of the function use return ret;
.
spmd_dev_func(k_int func, k_int a)
k_float ret;
k_if (a == 0)
k_set(ret, 0);
k_return;
k_endif
k_if (a == 1)
k_set(ret, -1);
k_return;
k_endif
k_set(ret, a);
return ret;
spmd_dev_func_end
This paragraph applies mainly when targeting CPUs. Using techniques described below won't affect GPUs.
If you are familiar with the SIMD technique of masking to emulate loops and
if's you may know that k_set
and k_store
are implemented using respectively
nsimd::if_else
and nsimd::maskz_storeu
which may incur performance
penalties. When you know that a simple assignment or store is sufficient
you may use the unmasked variants:
k_unmasked_set
translates into a C++ assignment.
k_unmasked_store
translates into a C++ SIMD store.
Their arguments are exactly the same as k_set
and k_store
. Unmasked
operations can usually be used at the beginning of device functions and also
inside loops, on temporary variables, knowing that the result of the latter
won't be needed later.
You may also use C++ standard keywords and constructs. But be aware that doing so will apply all the same treatment too all SIMD lanes. This can be useful when the operations involved are independant of the processed data as in the example below.
spmd_dev_func(k_float newton_raphson_sqrt, k_float a, k_float x0)
k_float ret;
for (int i = 0; i < 6; i++) {
k_unmasked_set(ret, (ret + ret * a) / 2.0f);
}
return ret;
spmd_dev_func_end