numba Support inlining C and C++ (or even LLVM IR) code into nopython-jitted function/class

Feature request

Would be great if it was possible to inline regular C or C++ code into nopython-jitted function. Something like following:

@numba.njit
def f(a):
    c_funcs = numba.c_func("""
        inline int add(int a, int b) { return a + b; }
        inline int mul(int a, int b) { return a * b; }
    """)
    b = 3
    for i in range(5):
        a = c_funcs.mul(c_funcs.add(a, a), b)
    return a

Main idea here that C functions (add, mul) code should be inlined into f() and optimized by LLVM.

Of cause there is CFFI support that allows to compile any C functions as .pyd module and then use them inside njited-function. But drawback that these C functions are not inlined (they are called by address) into njited code hence not optmized by LLVM as a whole.

I think there should be some way to mix Numba Python's code and C/C++ code directly, because not everything can be done in pure Python.

For example if I want to do multiplication of u64 x u64 -> u128 then there is no such single-instruction operation in Python and Numba, while in C/C++ it can be done by unsigned __int128 c = uint64_t(a) * uint64_t(b); in Clang or uint64_t hi, lo = _umul128(a, b, &hi); in MSVC. Which results in single Assembler mul instruction using several CPU cycles. In python you can't do this as one CPU instruction.

Of course one can make array u128-multiplication C function using CFFI then non-inlined function call overhead will be small. But it is not always possible to act on whole array - for example I want to implement jitclass that emulates u128 and use this u128 class everywhere for single-value variables in some njitted mathematical code, where there is no work on array at all.

Another use-case is to implement jitclass that emulates BigInteger so that BigInteger (similar like python's int) will be available in nopython-function. Of cause efficient single-value (non-array) BigInteger is not possible to implement without inlineable C/C++ functions.

Why C/C++ inlining is crucial? Because it often happens that Numba's Python is lacking some operation and in C/C++ (or even Assembler) this operation can be done as 1-3 CPU instructions. Non-inlined function call that does 1-3 instructions will have too huge overhead.

Also as Numba is LLVM-based then would be great to also be possible to inline LLVM IR (LLVM Intermediate Representation). Or other kind of Assembler-like language. Because when Python code is jitted of cause it is converted to LLVM IR at some point, hence inlining one LLVM IR into another one looks like natural thing.

Inlining LLVM IR will allow anybody to inline any-language code. For example you don't support Rust. But Rust developer can compile Rust to LLVM IR (I think Rust is supported by some fork (or official) of LLVM) and then inline this LLVM IR into your nopython-jitted code. Hence LLVM IR inlining will allow to support any possible language that is based on LLVM.

Sep 05 '21 09:09 polkovnikov

Thanks for the request. Permitting the use of string based C/C++ code in Numba is possible but probably practically quite difficult. It would require Numba to "know" about the clang compiler or interface with the relevant parts of it it cf. llvmlite/LLVM and all the challenges that brings (text/bitcode repr, which LLVM versions in use, finding the tool chain etc). I'll raise this at the weekly public meeting tomorrow https://numba.discourse.group/t/weekly-public-meeting-every-tuesday-for-2021/658/2.

Numba can already consume LLVM IR from text and bitcode via llvmlite, example https://llvmlite.readthedocs.io/en/stable/user-guide/binding/examples.html (API: https://llvmlite.readthedocs.io/en/stable/user-guide/binding/modules.html#factory-functions)

Inline assembly is also supported: https://llvmlite.readthedocs.io/en/stable/user-guide/ir/ir-builder.html#llvmlite.ir.IRBuilder.asm

The above use the builder class, which can be accessed via Numba's @intrinic decorator https://numba.readthedocs.io/en/stable/extending/high-level.html#implementing-intrinsics

Sep 06 '21 13:09 stuartarchibald

@stuartarchibald Looking at your example here. If I understand correctly, this example compiles IR and wraps it into ctypes function cfunc(a, b).

If I use such cfunc() from Numba's njitted-function then I think this function is NOT INLINED into njitted-function, right?

By inlining I mean not just placing a call instruction into njitted function, but actually compiling whole njitted function together with LLVM IR of cfunc() as a whole. Same what inline modifier does in C/C++. Whole njitted-function IR should be mixed with inlined cfunc()'s IR and then optimized together using LLVM, that I mean by inlining.

Your example link just does a + b. If you use such function in some heavy-computational loop then doing extra call instruction on each a + b is a huge overhead. More than that not just a call instruction is an overhead. C++ compilers do a lot of things when inlining inline function, do different propagation of registers and bit magic. So inline-ed C++ code sometimes is even 5-10 times faster than non-inlined.

Same with my suggestion about C++ above - of cause non-inlined C/C++ code I can already create and call through ctypes/cffi (maybe even Cython). The only reason I suggested my C++ proposal above is because I wanted to have not the handy ability of compiling C++ as a Python string, but ability to use great LLVM optimizer to inline tiny functions like a + b and not to have call instruction overhead.

Same of LLVM IR - I want not just an ability to somehow compile/call IR bitcode, but actually to do all optimization of inlining, same like inline functions optimized in C++.

Basically my proposal above is only related to speed optimization. Without considering run speed of code I can find different ways how to compile C/C++/Asm/LLVM-IR into .pyd and call functions of this module from njitted-function. But I wanted to achieve speed of my code.

Sep 06 '21 18:09 polkovnikov

@stuartarchibald Looking at your example here. If I understand correctly, this example compiles IR and wraps it into ctypes function cfunc(a, b).

If I use such cfunc() from Numba's njitted-function then I think this function is NOT INLINED into njitted-function, right?

Correct, this example does do that. But in your case, you'd not access the function via ctypes, you'd just generate a call to it using an @intrinsic https://numba.readthedocs.io/en/stable/extending/high-level.html#implementing-intrinsics.

By inlining I mean not just placing a call instruction into njitted function, but actually compiling whole njitted function together with LLVM IR of cfunc() as a whole. Same what inline modifier does in C/C++. Whole njitted-function IR should be mixed with inlined cfunc()'s IR and then optimized together using LLVM, that I mean by inlining.

This is understood, and I think possible.

Your example link just does a + b. If you use such function in some heavy-computational loop then doing extra call instruction on each a + b is a huge overhead. More than that not just a call instruction is an overhead. C++ compilers do a lot of things when inlining inline function, do different propagation of registers and bit magic. So inline-ed C++ code sometimes is even 5-10 times faster than non-inlined.

Yes, this is why you need to compile the external source to bitcode/LLVM IR and add that module to the library that Numba is generating code into such that it can be all linked together and inlining/many other related optimisations take place.

Same with my suggestion about C++ above - of cause non-inlined C/C++ code I can already create and call through ctypes/cffi (maybe even Cython). The only reason I suggested my C++ proposal above is because I wanted to have not the handy ability of compiling C++ as a Python string, but ability to use great LLVM optimizer to inline tiny functions like a + b and not to have call instruction overhead.

Same of LLVM IR - I want not just an ability to somehow compile/call IR bitcode, but actually to do all optimization of inlining, same like inline functions optimized in C++.

Basically my proposal above is only related to speed optimization. Without considering run speed of code I can find different ways how to compile C/C++/Asm/LLVM-IR into .pyd and call functions of this module from njitted-function. But I wanted to achieve speed of my code.

I've got an example of how to do all this but have one more thing to work out prior to sharing it.

The conclusion from the Numba meeting was that it is probably not something Numba can support directly due to the complexity of ensuring valid compilers/LLVM IR versions/type system behaviours etc. However, some of the parts needed to actually implement this could well be abstracted as something that Numba could support, for example, linking in an external bitcode source.

Sep 14 '21 09:09 stuartarchibald

@stuartarchibald Can you tell in few lines of code how to write @intrinsic from LLVM IR assembly?

I've read https://numba.readthedocs.io/en/stable/extending/high-level.html#implementing-intrinsics but there IR is generated by IRBuilder. And in my case I have ready-made text of LLVM IR assembly like following:

define dso_local double @Add(double %a, double %b) local_unnamed_addr #0 {
entry:
  %add = fadd double %a, %b
  ret double %add
}

For example having assembly code above how can I tell inside @intrinsic function something like:

@intrinsic
def f(a, b):
   return IRBuilder.to_bitcode(assembly_llvm_ir_text_from_above).get_func('Add')(a, b)

i.e. inside @intrisic give LLVM IR assembly and return generated bitcode with arguments (a, b) applied. This instrinsic f() should
inline this bitcode and optimize together with @njit-function code that uses this intrinsic.

Also I think its quite enough to be able to embed/inline LLVM IR instead of C/C++ code. Anyone can compile C++ to LLVM IR, so it is not a big problem.

Sep 18 '21 13:09 polkovnikov

@polkovnikov here's an example of how to do what's in the OP, the other cases you have mentioned are simplifications of this. I hope at some point to extract some useful parts into Numba's public extension API (the part about linking in some bitcode). The thing I've not sorted out yet in this example is the forcible inlining of the functions defined in the C source.

from numba import njit, types, literally
from numba.extending import overload, intrinsic
from numba.core import cgutils
import numpy as np
import subprocess
import tempfile
import llvmlite.binding as llvm
from llvmlite import ir
from collections import namedtuple, OrderedDict

def compile_cfunc(string):
    pass

@overload(compile_cfunc)
def ol_compile_cfunc(string, sigs):
    # invoke clang
    if not isinstance(string, types.Literal):
        def impl(string, sigs):
            literally(string)
        return impl
    c_src = string.literal_value
    sig_map = sigs.initial_value

    c_module = None
    # compile the C source
    with tempfile.TemporaryDirectory() as tmpdir:
        with tempfile.NamedTemporaryFile(mode='wt',
                                         encoding='ascii',
                                         dir=tmpdir,
                                         suffix='.c') as c_src_file:
            c_src_file.write(c_src)
            c_src_file.flush()
            cmd = 'clang -emit-llvm -c'.split(' ')
            bc_file = c_src_file.name.replace('.c','.bc')
            subprocess.run(cmd + [c_src_file.name, '-o', bc_file])
            with open(bc_file, 'rb') as bc:
                bc_bytes = bc.read()
            c_module = llvm.parse_bitcode(bc_bytes)

    assert c_module is not None, "Failed to compile C code"
    c_module.verify()

    # create an ordered map of C function name to signature based on the sig map
    # this is important as the struct member name generation order needs to
    # match up with whats been generated
    funcs = [f for f in c_module.functions]
    sigs = OrderedDict()
    for func in funcs:
        assert func.name in sig_map
        sigs[func.name] = sig_map[func.name]

    # add the C module to the code library
    @intrinsic
    def add_to_ee(tyctx,):
        sig = types.none()
        def codegen(cgctx, builder, sig, llargs):
            cgctx.active_code_library.add_llvm_module(c_module)
        return sig, codegen

    # this a dynamically tuple pretending to be a struct
    c_struct = namedtuple('c_struct', [*sigs.keys()])

    # generate dispatcher stubs
    dispatchers = []
    for fname in sigs.keys():
        def gen(fname=fname):
            tysig = sigs[fname]
            sigty = eval(tysig, {}, types.__dict__)
            @intrinsic
            def gen_call(tyctx, arg):
                sig = sigty.return_type(arg)
                # make sure the incoming args match the declared
                declared_args = sigty.args
                presented_arg = arg
                if isinstance(presented_arg, types.containers._StarArgTupleMixin):
                    assert presented_arg.types == declared_args
                else:
                    assert 0, 'unreachable'

                def codegen(cgctx, builder, sig, llargs):
                    stararg = llargs[0]
                    tupl = cgutils.unpack_tuple(builder, stararg)
                    mod = builder.module
                    ll_arg_tys = [cgctx.get_value_type(x) for x in sigty.args]
                    ll_retty = cgctx.get_value_type(sigty.return_type)
                    ll_sig_ty = ir.FunctionType(ll_retty, ll_arg_tys)
                    fn = cgutils.get_or_insert_function(mod, ll_sig_ty, fname)
                    return builder.call(fn, tupl)
                return sig, codegen

            @njit(inline='always')
            def fncall(*args):
                return gen_call(args)

            return fncall

        dispatchers.append(gen(fname=fname))

    # create the struct instance
    c_struct_inst = c_struct(*dispatchers)

    # return this trivial function, it forces the C code module into the EE
    # and returns the c_struct containing the dispatchers from globals
    def impl(string, sigs):
        add_to_ee()
        return c_struct_inst
    return impl


@njit
def f(a):
    c_funcs = compile_cfunc("""
    extern int add(int a, int b) { return a + b; }
    extern int mul(int a, int b) { return a * b; }
    extern double fmadd(double a, double b, double c) { return a + (b * c); }
    extern double mixed_fmadd(int a, int b, double c) { return a + (b * c); }
    """,
    {'add': 'intp(intp, intp)',
     'mul': 'intp(intp, intp)',
     'fmadd': 'double(double, double, double)',
     'mixed_fmadd': 'double(intp, intp, double)'})
    b = 7
    for i in range(5):
        a = c_funcs.mul(c_funcs.add(a, a), b)
    x = c_funcs.fmadd(np.float64(a), np.float64(b), 11.)
    y = c_funcs.mixed_fmadd(a, b, 11.)
    return a, x, y

got = f(3)


def g(a):
    b = 7
    def mul(x, y):
        return x * y
    def add(x, y):
        return x + y
    def fmadd(p, q, r):
        return p + (q * r)
    for i in range(5):
        a = mul(add(a, a), b)
    return a, fmadd(a, b, 11.), fmadd(a, b, 11.)

expected = g(3)

print(f"got: {got}, expected: {expected}. OK={got==expected}")
assert got == expected

Sep 28 '21 09:09 stuartarchibald

@stuartarchibald Your code above, without modifications, doesn't compile on my latest (from PIP) Numba - I have Win-64 Python 3.9.1, Numba ver 0.54.0, LlvmLite 0.37.0. It throws this error dump (in short Cannot request literal type at line 112 (line c_funcs = compile_cfunc(""")).

Maybe it needs another version of Numba?

Sep 29 '21 05:09 polkovnikov

@stuartarchibald Your code above, without modifications, doesn't compile on my latest (from PIP) Numba - I have Win-64 Python 3.9.1, Numba ver 0.54.0, LlvmLite 0.37.0. It throws this error dump (in short Cannot request literal type at line 112 (line c_funcs = compile_cfunc(""")).

Maybe it needs another version of Numba?

You most probably need to put clang on the path, the above would likely happen if the call to clang (or the compilation) failed.

Sep 29 '21 08:09 stuartarchibald

@stuartarchibald Thanks, indeed it was the problem of Clang's not being present in PATH. After it was fixed I got correct output:

got: (1613472, 1613549.0, 1613549.0), expected: (1613472, 1613549.0, 1613549.0). OK=True

Can you tell why assert inside call to compile_cfunc() didn't show up? Even if I put assert False in the beginning of a function then I get Numba's error Cannot request literal type instead of any Exception.

Is there any way for compile_cfunc() to inform user about exceptions that happened inside its body? Of cause I can put global try/except + print('Exception:', exception), but maybe there is other Official for Numba way to propagate error reason to user?

Sep 29 '21 09:09 polkovnikov

@stuartarchibald I have another question regarding you cool code above (BTW, your code above should probably become some util function of Numba package).

Your code uses just single-module of LLVM IR bitcode. But how to deal with external precompiled libraries and/or multi-module LLVM IR?

For multi-module LLVM IR or libraries available in source I guess I can convert (compile) all .cpp/.c files to .bc and then join multiple .bc into single .bc through llvm-link program.

But what to do with precompiled libraries, that are available as .lib files only? For example C++ STD library is one such example, all functions that are not inlined are available only through external linkage of .lib files.

Is there any place in your code above where I can feed and link-in .lib files? I guess I can't convert .lib files to .bc, right (as .bc files are CPU-model-independent and .lib are for concrete CPU)?

As I understand Numba compiles all code to final LLVM IR and then LLVM converts it to CPU-dependent machine code. So probably after converting to CPU-depenent form LLVM might also allow to link-in external .lib files designed for given CPU.

I think Numba already links-in some .lib files. For example Numba (as I understand) uses C++ or C as backend (converts from Python code to C++/C), it means some STD libraries' functions should be used definitely, hence to compile final JIT code Numba should somehow link-in these .lib files (at least of STD library).

Sep 30 '21 06:09 polkovnikov

@stuartarchibald, Very nice example of including C functions! Unfortunately, the example does not seem to work anymore with llvmlite==0.42.0 and Clang 14, requiring some adjustments. Apparently, the generated bitcode cannot be directly parsed by llvmlite. However, it is possible to perform parse_assembly after some adjustments to the generated LLVM code metadata:

replace uwtable(sync) by "uwtable(sync)", i.e. in parentheses
reset behavior flag in llvm.module.flags metadata from 8 to 1 (8 is not recognised by llvmlite, while the value 1 was used in the older code)

Moreover, Clang adds an internal function llvm.fmuladd.f64 to the module, which is not listed in sig_map, failing the assert statement. Fortunately, after fixing these the adjusted example works. Thanks for sharing!

May 18 '24 19:05 dima-quant

numba numba copied to clipboard

Support inlining C and C++ (or even LLVM IR) code into nopython-jitted function/class

Feature request

numba
numba copied to clipboard