NSIMD documentation

SPMD programming module documentation

Overview

What is SPMD?

SPMD stands for Single Program Multiple Data. It is a programming paradigm. It is used by NVIDIA CUDA. Its strengh lies in writing computation kernels. Basically you concentrate your attention on the kernel itself and not on how to run it. An example is worth more than a long speech, let's take vector addition of float's.

spmd_kernel_1d(add, float *dst, float *a, float *b)
  k_store(dst, k_load(a) + k_load(b));
spmd_kernel_end

It would be written as follows for CUDA (assuming that the vector lenghts are multiples of block's sizes).

__global__ add(float *dst, float *a, float *b) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  dst[i] = a[i] + b[i];
}

NSIMD's SPMD is a small DSL in standard C++98 that can be used to write computation kernels for GPUs (NVIDIA's and AMD's) and any SIMD units supported by NSIMD. On a more technical side, the DSL keywords are macros that:

translates to C-ish keywords for GPUs and
use masks for CPUs as Intel ISPC (https://ispc.github.io/).

The difference between NSIMD's SPMD is that a single code can be compiled to target GPUs and CPUs whereas:

NVIDIA CUDA only targets NVIDIA GPUs
AMD HIP only targets NVIDIA and AMD GPUs
INTEL ICP only targets Intel SIMD units and ARM NEON

Writing kernels and device functions

As for CUDA kernels you can write templated and non-templated CUDA kernels. Declaring a kernel function and launching it is straight forward:

spmd_kernel_1d(kernel_name, arguments)
  // kernel code
spmd_kernel_end

int main() {

  spmd_launch_kernel_1d(kernel_name, bit_width, param,
                        vector_size, arguments);

  return 0;
}

The bit_width argument indicates the types width in bits that will be available inside kernels. The param argument indicates the unroll factor for CPUs and the number of threads per block for GPUs. The vector_size argument indicates the vectors length passed as arguments.

Device functions can also been implemented. They are functions that will only run on the device. As for kernels, they have the same restrictions.

spmd_dev_func(k_float device_func, k_float a, k_float b)
  // Device function code
spmd_dev_func_end

spmd_kernel_1d(kernel, arguments)

  // ...

  spmd_call_dev_func(device_func, a, b);

  // ...

spmd_kernel_end

The caveat with spmd_dev_func is that its first argument must be the return type followed by the device function name.

It is also possible to write templated kernels. Due to C++ __VA_ARGS__ limitations the number of template argument is limited to one of kind typename. If more types or integers are to be passed to device kernels or functions they have to be boxed inside a struct.

struct mul_t {
  spmd_dev_func(static k_float dev_impl, k_float a, k_float b)
    return a * b;
  spmd_dev_func_end
};

struct add_t {
  spmd_dev_func(static k_float dev_impl, k_float a, k_float b)
    return a + b;
  spmd_dev_func_end
};

// Op is the template argument (typename Op in C++ code)
spmd_tmpl_dev_func(k_float trampoline, Op, k_float a, k_float b)
  return Op::template spmd_call_dev_func(dev_impl, a, b);
spmd_dev_func_end

// Op is the template argument (typename Op in C++ code)
spmd_tmpl_kernel_1d(tmpl_kernel, Op, arguments)

  // ...

  spmd_call_tmpl_dev_func(trampoline, Op, a, b);

  // ...

spmd_kernel_end

int main() {

  // Kernel call for addition
  spmd_launch_tmpl_kernel_1d(tmpl_kernel, add_t, 32, 1, N, arguments);

  // Kernel call for multiplication
  spmd_launch_tmpl_kernel_1d(tmpl_kernel, mul_t, 32, 1, N, arguments);

  return 0;
}

The NSIMD SPMD C++ DSL

The DSL is of course constraint by C++ syntax and constructs. This implies some strange syntax and the impossibility to use infix operator =. For now (2020/05/16) the NSIMD SPMD DSL does only supports if's, while-loops and returns. It seems that for-loops and do-while-loops cannot be nicely proposed, i.e. with a nice syntax, the switch-case keywords cannot be implemented with a good conformence to the semantic of their C++ counterparts. Goto's also cannot be implemented properly.

Variables types available in kernels and device functions

The following self-explanatory variable types are available inside kernels and devices functions:

k_int for signed integers
k_uint for unsigned integers
k_float for floatting point numbers
k_bool for booleans

As explained above the bit-width of the above types are determined by the launch kernel function. Note that k_float does not exists for 8-bits types.

Load/store from/to memory

Given a pointer, the proper way to load data is to use k_load(ptr). For storing a value to memory k_store is to be used.

k_store(ptr, value);
k_store(ptr, expression);

As explained above, there is no need to compute the offset to apply to pointers. This is hidden from the programmer.

Assignment operator (`operator=`)

Due to C++ ADL (https://en.cppreference.com/w/cpp/language/adl) and the need for keeping things simple for the compiler (which does not always mean simple for the programmer) the use of infix operator = will not produce a copmilation error but will give incorrect result. You should use k_set.

k_set(var, value);
k_set(var, expression);

As written above, k_set assign value or the result of an expression to a variable.

if, then, else

You should not use plan C++ if's or else's. This will not cause compilation error but will produce incorrect results at runtime. You should use k_if, k_else, k_elseif and k_endif instead. they have the same semantic as their C++ counterparts.

spmd_kernel_1d(if_elseif_else, float *dst, float *a_ptr)

  k_float a, ret;
  k_set(a, k_load(a_ptr));

  k_if (a > 15.0f)

    k_set(ret, 15.0f);

  k_elseif ( a > 10.0f)

    k_set(ret, 10.0f);

  k_elseif ( a > 5.0f)

    k_set(ret, 5.0f);

  k_else

    k_set(ret, 0.0f);

  k_endif

  k_store(dst, ret);

spmd_kernel_end

while loops

You should not use plan C++ while's, break's and continue's. This will not cause compilation error but will produce incorrect results at runtime. You should use k_while, k_break, k_continue and k_endif instead. They have the same semantic as their C++ counterparts.

spmd_kernel_1d(binpow, float *dst, float *a_ptr, int *p_ptr)

  k_float a, ret;
  k_set(a, k_load(a_ptr));
  k_set(ret, 1.0f);
  k_int p;
  k_set(p, k_load(p_ptr));

  k_while(p > 0)

    k_if ((p & 1) != 0)

      k_set(ret, ret * a);

    k_endif

    k_set(a, a * a);
    k_set(p, p >> 1);

  k_endwhile

  k_store(dst, ret);

spmd_kernel_end

Returns

Returns cannot be implemented as macros overloading is not possible in a standard way with an overload taking zero arguments. So returning has to be done correctly. The k_return keyword has the same semantic as the C++ return keyword without arguments and can be used at will for kernels (as kernels return type is always void) and for device functions returning void.

For device functions returning a value it is recommanded to proceed this way:

Declare a variable, say ret, to store the return value.
Whereever you need to return, set the variable appropriately with k_set and return with k_return.
At the end of the function use return ret;.

spmd_dev_func(k_int func, k_int a)

  k_float ret;

  k_if (a == 0)
    k_set(ret, 0);
    k_return;
  k_endif

  k_if (a == 1)
    k_set(ret, -1);
    k_return;
  k_endif

  k_set(ret, a);

  return ret;

spmd_dev_func_end

Advanced techniques and functions

This paragraph applies mainly when targeting CPUs. Using techniques described below won't affect GPUs.

If you are familiar with the SIMD technique of masking to emulate loops and if's you may know that k_set and k_store are implemented using respectively nsimd::if_else and nsimd::maskz_storeu which may incur performance penalties. When you know that a simple assignment or store is sufficient you may use the unmasked variants:

k_unmasked_set translates into a C++ assignment.
k_unmasked_store translates into a C++ SIMD store.

Their arguments are exactly the same as k_set and k_store. Unmasked operations can usually be used at the beginning of device functions and also inside loops, on temporary variables, knowing that the result of the latter won't be needed later.

You may also use C++ standard keywords and constructs. But be aware that doing so will apply all the same treatment too all SIMD lanes. This can be useful when the operations involved are independant of the processed data as in the example below.

spmd_dev_func(k_float newton_raphson_sqrt, k_float a, k_float x0)
  k_float ret;
  for (int i = 0; i < 6; i++) {
    k_unmasked_set(ret, (ret + ret * a) / 2.0f);
  }
  return ret;
spmd_dev_func_end