NSIMD documentation
Index | Tutorial | FAQ | Contribute | API overview | API reference | Wrapped intrinsics | Modules

How to Contribute to nsimd?

You are welcome to contribute to nsimd. This document gives some details on how to add/wrap new intrinsics. When you have finished fixing some bugs or adding some new features, please make a pull request. One of our repository maintainer will then merge or comment the pull request.

Prerequisites

How Do I Add Support for a New Intrinsic?

Introduction

nsimd currently supports the following architectures:

nsimd currently supports the following types:

As C and C++ do not support float16, nsimd provides its own types to handle them. Therefore special care has to be taken when implementing intrinsics/operators on architecures that do not natively supports them.

We will make the following misuse of language in the rest of this document. The type taken by intrinsics is of course a SIMD vector and more precisely a SIMD vector of chars or a SIMD vector of shorts or a SIMD vector of ints… Therefore when we will talk about an intrinsic, we will say that it takes type T as arguments when it takes in fact a SIMD vector of T.

Our imaginary intrinsic

We will add support to the library for the following imaginary intrinsic: given a SIMD vector, suppose that this intrisic called foo takes each element x of the vector and compute 1 / (1 - x) + 1 / (1 - x)^2. Moreover suppose that hardware vendors all propose this intrisic only for floatting point numbers as follows:

First thing to do is to declare this new intrinsic to the generation system. A lot of work is done by the generation system such as generating all functions signatures for C and C++ APIs, tests, benchmarks and documentation. Of course the default documentation does not say much but you can add a better description.

Registering the intrinsic (or operator)

A function or an intrinsic is called an operator in the generation system. Go at the bottom of egg/operators.py and add the following just after the Rsqrt11 class.

class Foo(Operator):
    full_name = 'foo'
    signature = 'v foo v'
    types = common.ftypes
    domain = Domain('R\{1}')
    categories = [DocBasicArithmetic]

This little class will be processed by the generation system so that operator foo will be available for the end-user of the library in both C and C++ APIs. Each member of this class controls how the generation is be done:

In our case v foo v means that foo takes one SIMD vector as argument and returns a SIMD vector as output. Several signatures will be generated for this intrinsic according to the types it can supports. In our case the intrinsic only support floatting point types.

Many other members are supported by the generation system. We describe them quickly here and will give more details in a later version of this document. Default values are given in square brakets:

Implementing the operator

Now that the operator is registered, all signatures will be generated but the implemenatations will be missing. Type

python3 egg/hatch.py -lf

and the following files (among many other) should appear:

They each correspond to the implementations of the operator for each supported architectures. When openening one of these files the implementations in plain C and then in C++ (falling back to the C function) should be there but all the C implementations are reduced to abort();. This is the default when none is provided. Note that the "cpu" architecture is just a fallback involving no SIMD at all. This is used on architectures not supported by nsimd or when the architectures does not offer any SIMD.

Providing implementations for foo is done by completing the following Python files:

The idea is to produce plain C (not C++) code using Python string format. Each of the Python files provides some helper functions to ease as much as possible the programmer's job. But every file provides the same "global" variables available in every functions and is designed in the same way:

  1. At the bottom of the file is the get_impl function taking the following arguments:

  2. Inside this function lies a Python dictionary that provides functions implementing each operator. The string containing the C code for the implementations can be put here directly but usually the string is returned by a Python function that is written above in the same file.

  3. At the top of the file lies helper functions that helps generating code. This is specific to each architecture. Do not hesitate to look at it.

Let's begin by the cpu implementations. It turns out that there is no SIMD extension in this case, and by convention, simd_ext == 'cpu' and this argument can therefore be ignored. So we first add an entry to the impls Python dictionary of the get_impl function:

    impls = {

        ...

        'reverse': reverse1(from_typ),
        'addv': addv(from_typ),
        'foo': foo1(from_typ) # Added at the bottom of the dictionary
    }
    if simd_ext != 'cpu':
        raise ValueError('Unknown SIMD extension "{}"'.format(simd_ext))

    ...

Then, above in the file we write the Python function foo1 that will provide the C implementation of operator foo:

def foo1(typ):
    return func_body(
           '''ret.v{{i}} = ({typ})1 / (({typ})1 - {in0}.v{{i}}) +
                           ({typ})1 / ((({typ})1 - {in0}.v{{i}}) *
                                       (({typ})1 - {in0}.v{{i}}));'''. \
                                       format(**fmtspec), typ)

First note that the arguments names passed to the operator in its C implementation are not known in the Python side. Several other parameters are not known or are cumbersome to find out. Therefore each function has access to the fmtspec Python dictionary that hold some of these values:

The CPU extension can emulate 64-bits or 128-bits wide SIMD vectors. Each type is a struct containing as much members as necessary so that sizeof(T) * (number of members) == 64 or 128. In order to avoid the developper to write two cases (64-bits wide and 128-bits wide) the func_body function is provided as a helper. Note that the index {{i}} is in double curly brackets to go through two Python string formats:

  1. The first pass is done within the foo1 Python function and replaces {typ} and {in0}. In this pass {{i}} is formatted into {i}.

  2. The second pass is done by the func_body function which unrolls the string to the necessary number and replace {i} by the corresponding number. The produced C code will look like one would written the same statement for each members of the input struct.

Then note that as plain C (and C++) does not support native 16-bits wide floating point types nsimd emulates it with a C struct containing 4 floats (32-bits swide floatting point numbers). In some cases extra care has to be taken to handle this type.

For each SIMD extension one can find a types.h file (for cpu the files can be found in include/nsimd/cpu/cpu/types.h) that declares all SIMD types. If you have any doubt on a given type do not hesitate to take a look at this file. Note also that this file is auto-generated and is therefore readable only after a successfull first python3 egg/hatch -Af.

Now that the cpu implementation is written, you should be able to write the implementation of foo for other architectures. Each architecture has its particularities. We will cover them now by providing directly the Python implementations and explaining in less details.

Finally note that clang-format is called by the generation system to autoformat produced C/C++ code. Therefore prefer indenting C code strings within the Python according to Python indentations, do not write C code beginning at column 0 in Python files.

For Intel

def foo1(simd_ext, typ):
    if typ == 'f16':
        return '''nsimd_{simd_ext}_vf16 ret;
                  ret.v1 = {pre}foo_ps({in0}.v1);
                  ret.v2 = {pre}foo_ps({in0}.v2);
                  return ret;'''.format(**fmtspec)
    if simd_ext == 'sse2':
        return emulate_op1('foo', 'sse2', typ)
    if simd_ext in ['avx', 'avx512_knl']:
        return split_opn('foo', simd_ext, typ, 1)
    return 'return {pre}foo{suf}({in0});'.format(**fmtspec)

Here are some notes concerning the Intel implementation:

  1. float16s are emulated with two SIMD vectors of floats.

  2. When the intrinsic is provided by Intel one can access it easily by constructing it with {pre} and {suf}. Indeed all Intel intrinsics names follow a pattern with a prefix indicating the SIMD extension and a suffix indicating the type of data. As for {in0}, {pre} and {suf} are provided and contain the correct values with respect to simd_ext and typ, you do not need to compute them yourself.

  3. When the intrinsic is not provided by Intel then one has to use tricks.

  4. Do not forget to add the foo entry to the impls dictionary in the get_impl Python function.

For ARM

def foo1(simd_ext, typ):
    ret = f16f64(simd_ext, typ, 'foo', 'foo', 1)
    if ret != '':
        return ret
    if simd_ext in neon:
        return 'return vfooq_{suf}({in0});'.format(**fmtspec)
    else:
        return 'return svfoo_{suf}_z({svtrue}, {in0});'.format(**fmtspec)

Here are some notes concerning the ARM implementation:

  1. float16s can be natively supported but this is not mandatory.

  2. On 32-bits ARM chips, intrinsics on double almost never exist.

  3. The Python helper function f16f64 hides a lot of details concerning the above two points. If the function returns a non empty string then it means that the returned string contains C code to handle the case given by the pair (simd_ext, typ). We advise you to look at the generated C code. You will see the nsimd_FP16 macro used. When defined it indicates that nsimd is compiled with native float16 support. This also affect SIMD types (see nsimd/include/arm/*/types.h.)

  4. Do not forget to add the foo entry to the impls dictionary in the get_impl Python function.

For IBM POWERPC

def foo1(simd_ext, typ):
    if has_to_be_emulated(simd_ext, typ):
        return emulation_code(op, simd_ext, typ, ['v', 'v'])
    else:
        return 'return vec_foo({in0});'.format(**fmtspec)

Here are some notes concerning the PPC implementation:

  1. For VMX, intrinsics on double almost never exist.

  2. The Python helper function has_to_be_emulated returns True when the implementation of foo concerns float16 or doubles for VMX. When this function returns True you can then use emulation_code.

  3. The emulation_code function returns a generic implementation of an operator. However this iplementation is not suitable for any operator and the programmer has to take care of that.

  4. Do not forget to add the foo entry to the impls dictionary in the get_impl Python function.

The scalar CPU version

def foo1(func, typ):
    normal = \
    'return ({typ})(1 / (1 - {in0}) + 1 / ((1 - {in0}) * (1 - {in0})));'. \
    if typ == 'f16':
        return \
        '''#ifdef NSIMD_NATIVE_FP16
             {normal}
           #else
             return nsimd_f32_to_f16({normal_fp16});
           #endif'''. \
           format(normal=normal.format(**fmtspec),
                  normal_fp16=normal.format(in0='nsimd_f16_to_f32({in0})))
    else:
        return normal.format(**fmtspec)

The only caveat for the CPU scalar implementation is to handle float16 correctly. The easiest way to do is to have the same implementation as float32 but replacing {in0}'s by nsimd_f16_to_f32({in0})'s and converting back the float32 result to a float16.

The GPU versions

The GPU generator Python files cuda.py, rocm.py and oneapi.py are a bit different from the other files but it is easy to find where to add the relevant pieces of code. Note that ROCm syntax is fully compatible with CUDA's one only needs to modify the cuda.py file while it easy to understand oneapi.py.

The code to add for float32's is as follows to be added inside the get_impl Python function.

return '1 / (1 - {in0}) + 1 / ((1 - {in0}) * (1 - {in0}))'.format(**fmtspec)

The code for CUDA and ROCm to add for float16's is as follows. It has to be added inside the get_impl_f16 Python function.

arch53_code = '''__half one = __float2half(1.0f);
                 return __hadd(
                               __hdiv(one, __hsub(one, {in0})),
                               __hmul(
                                      __hdiv(one, __hsub(one, {in0})),
                                      __hdiv(one, __hsub(one, {in0}))
                                     )
                              );'''.format(**fmtspec)

As Intel oneAPI natively support float16's the code is the same as the one for floats:

return '1 / (1 - {in0}) + 1 / ((1 - {in0}) * (1 - {in0}))'.format(**fmtspec)

Implementing the test for the operator

Now that we have written the implementations for the foo operator we must write the corresponding tests. For tests all generations are done by egg/gen_tests.py. Writing tests is more simple. The intrinsic that we just implemented can be tested by an already-written test pattern code, namely by the gen_test Python function.

Here is how the egg/gen_tests.py is organized:

  1. The entry point is the doit function located at the bottom of the file.

  2. In the doit function a dispatching is done according to the operator that is to be tested. All operators cannot be tested by the same C/C++ code. The reading of all different kind of tests is rather easy and we are not going through all the code in this document.

  3. All Python functions generating test code begins with the following:

        filename = get_filename(opts, op, typ, lang)
        if filename == None:
            return

    This must be the case for newly created function. The get_filename function ensures that the file must be created with respect to the command line options given to the egg/hatch.py script. Then note that to output to a file the Python function open_utf8 must be used to handle Windows and to automatically put the MIT license at the beginning of generated files.

  4. Tests must be written for C base API, the C++ base API and the C++ advanced API.

If you need to create a new kind of tests then the best way is to copy-paste the Python function that produces the test that resembles the most to the test you want. Then modify the newly function to suit your needs. Here is a quick overview of Python functions present in the egg/gen_test.py file:

Not all tests are to be done

As explained in how_tests_are_done.md doing all tests is not recommanded. Take for example the cvt operator. Testing cvt from say f32 to i32 is complicated as the result depends on how NaN, infinities are handled and on the current round mode. In turn these prameters depends on the vendor, the chip, the bugs in the chip, the chosen rounding mode by users or other softwares...

The function should_i_do_the_test gives an hint on whether to implement the test or not. Its code is really simple and you may need to modify it. The listing below is a possible implementation that takes care of the case described in the previous paragraph.

def should_i_do_the_test(operator, tt='', t=''):
    if operator.name == 'cvt' and t in common.ftypes and tt in common.iutypes:
        # When converting from float to int to float then we may not
        # get the initial result because of roundings. As tests are usually
        # done by going back and forth then both directions get tested in the
        # end
        return False
    if operator.name == 'reinterpret' and t in common.iutypes and \
       tt in common.ftypes:
        # When reinterpreting from int to float we may get NaN or infinities
        # and no ones knows what this will give when going back to ints
        # especially when float16 are emulated. Again as tests are done by
        # going back and forth both directions get tested in the end.
        return False
    if operator.name in ['notb', 'andb', 'andnotb', 'xorb', 'orb'] and \
       t == 'f16':
        # Bit operations on float16 are hard to check because they are
        # emulated in most cases. Therefore going back and forth with
        # reinterprets for doing bitwise operations make the bit in the last
        # place to wrong. This is normal but makes testing real hard. So for
        # now we do not test them on float16.
        return False
    if operator.name in ['len', 'set1', 'set1l', 'mask_for_loop_tail',
                         'loadu', 'loada', 'storeu', 'storea', 'loadla',
                         'loadlu', 'storela', 'storelu', 'if_else1']:
        # These functions are used in almost every tests so we consider
        # that they are extensively tested.
        return False
    if operator.name in ['store2a', 'store2u', 'store3a', 'store3u',
                         'store4a', 'store4u', 'scatter', 'scatter_linear',
                         'downcvt', 'to_logical']:
        # These functions are tested along with their load counterparts.
        # downcvt is tested along with upcvt and to_logical is tested with
        # to_mask
        return False
    return True

Conclusion

At first sight the implementation of foo seems complicated because intrinsics for all types and all architectures are not provided by vendors. But nsimd provides a lot of helper functions and tries to put away details so that wrapping intrinsics is quickly done and easy, the goal is that the programmer concentrate on the implementation itself. But be aware that more complicated tricks can be implemented. Browse through a platform_*.py file to see what kind of tricks are used and how they are implemented.

How do I add a new category?

Adding a category is way much simplier than an operator. It suffices to add a class with only one member named title as follows:

class DocMyCategoryName(DocCategory):
    title = 'My category name functions'

The class must inherit from the DocCategory class and its name must begin with Doc. The system will then take it into account, generate the entry in the documentation and so on.

How to I add a new module?

A module is a set of functionnalities that make sense to be provided alongside NSIMD but that cannot be part of NSIMD's core. Therefore it is not mandatory to provide all C and C++ APIs versions or to support all operators. For what follows let's call the module we want to implement mymod.

Include files (written by hand or generated by Python) must be placed into the nsimd/include/nsimd/modules/mymod directory and a master header file must be placed at nsimd/include/nsimd/modules/mymod.h. You are free to organize the nsimd/include/nsimd/modules/mymod folder as you see fit.

Your module has to be found by NSIMD generation system. For this you must create the nsimd/egg/modules/mymod directory and nsimd/egg/modules/mymod/hatch.py file. The latter must expose the following functions:

Tests for the module have to be put into the nsimd/tests/mymod directory.

How to I add a new platform?

The list of supported platforms is determined by looking in the egg directory and listing all platform_*.py files. Each file must contain all SIMD extensions for a given platform. For example the default (no SIMD) is given by platform_cpu.py. All the Intel SIMD extensions are given by platform_x86.py.

Each Python file that implements a platform must be named platform_[name for platform].py and must export at least the following functions:

Then you are free to implement the SIMd extensions for the platform. See above on how to add the implementations of operators.