CRoaring icon indicating copy to clipboard operation
CRoaring copied to clipboard

Use function pointers for runtime dispatching

Open Oppen opened this issue 4 years ago • 6 comments

Rather than branching each time, consider doing something like this:

int (*run_container_cardinality)(const run_container_t *run) = run_container_cardinality_dispatch;

int run_container_cardinality_dispatch(const run_container_t *run) {
    if (croaring_avx2()) {
        run_container_cardinality = _avx2_run_container_cardinality;
    } else {
        run_container_cardinality = _scalar_run_container_cardinality;
    }
    return run_container_cardinality(run);
}

This way after the first call all calls will be direct. It may give a tiny performance gain.

Oppen avatar Jun 13 '21 03:06 Oppen

Is it your expectation that the code you are proposing is thread safe?

I think you need atomic pointers.

Otherwise: pull requests are invited.

lemire avatar Jun 13 '21 19:06 lemire

Good point. Are indirect calls atomic in general? In ARM? x86?

Oppen avatar Jun 13 '21 19:06 Oppen

Good point. Are indirect calls atomic in general? In ARM? x86?

Even if the underlying hardware does it for free, you will still get flagged by the sanitizers.

The croaring_avx2() does rely on atomics for that reason. But it is not nearly as pretty as I would have hoped.

lemire avatar Jun 13 '21 20:06 lemire

See https://github.com/RoaringBitmap/CRoaring/blob/a5656ae1a0fae327487568a7490689844560fd22/include/roaring/isadetection.h#L170-L198

lemire avatar Jun 13 '21 20:06 lemire

Note that we don't do runtime dispatching under ARM. With ARM, NEON is usually available by default so you would not need it now. Even so, we may hope that, for that reason, compilers do a good job using NEON via autovectorization if the proper optimization flags are provided under ARM.

We only do runtime dispatching for AVX2 support. It is where the gains are most important since compilers won't emit AVX2 by default. But most people have AVX2 support at this point in time. So there is huge performance gap there.

lemire avatar Jun 13 '21 20:06 lemire

Note that I had a terrible performance bug in my initial implementation.

Fixed by https://github.com/RoaringBitmap/CRoaring/commit/6403d44cbea549b31c16266d57cf9a42aac320b6

I would call dynamic_croaring_detect_supported_architectures() on dispatching. We now call croaring_detect_supported_architectures().

The roaring_detect_supported_architectures() is much more efficient because it only does the CPU detection the first time it is called.

static inline uint32_t croaring_detect_supported_architectures() {
    static std::atomic<int> buffer{CROARING_UNINITIALIZED};
    if(buffer == CROARING_UNINITIALIZED) {
      buffer = dynamic_croaring_detect_supported_architectures();
    }
    return buffer;
}

It should be quite cheap.

Note that, in CRoaring, we use runtime dispatch strategically. It is not used for processing small blocks of data, it is always to process, e.g., a whole container or aggregate two containers. Of course, if you a ton of small containers, there is some overhead, but then you have other problems then.

lemire avatar Jul 14 '21 16:07 lemire