faster Arbitrary size bitvectors

Could this library support arbitrary sized bitvectors? Suppose i have two 32bits vectors, I could store them in an unsigned integer of 32 bits. But if I have bitvectors of 1200 bits it would be ideal to break them up into 2x512 bits and one 256 bit (using the remaining 176 needed bits). I would like to perform simple operations such as AND, OR, NOT and arithmetics like + and -

Nov 29 '17 22:11 flip111

These also a bit of a question of compile-time arbitrary sizes, and run-time (dynamic) arbitrary sizes. If my understanding of the work being done on Rust right now is correct then I believe compile-time may be possible soon. I can't find the RFC or PR at the moment though.

Nov 30 '17 02:11 nixpulvis

At the moment, bitvectors in Rust are typically backed by primitive types - as long as you can get the underlying type or slice, you can load it into a SIMD vector and treat it as if it were any other &[u8] et al.

Currently, faster would load your theoretical 1200-bit vector into two 512-bit registers (on an avx512 machine), and then operate on the remaining 170ish bits in a scalar fashion. I'm currently cooking up a change that would cause it to be loaded into three 512-bit registers. I'm not as familiar with other architectures' SIMD implementation, but the benefit of using a 128- or 256-bit register on x86 over a 256- or 512-bit register is negligible (and that kind of femto-optimization should be done in assembly anyway).

Bitwise operations on bitvectors makes sense, but what kind of arithmetic would you want to do on these types?

Nov 30 '17 02:11 AdamNiederer

@nixpulvis

These also a bit of a question of compile-time arbitrary sizes, and run-time (dynamic) arbitrary sizes. If my understanding of the work being done on Rust right now is correct then I believe compile-time may be possible soon. I can't find the RFC or PR at the moment though.

Compile-time would be sufficient for me. I guess for run-time it would need a JIT. I think you are referring to RFC https://github.com/rust-lang/rfcs/blob/master/text/2000-const-generics.md

@AdamNiederer

as long as you can get the underlying type or slice, you can load it into a SIMD vector and treat it as if it were any other &[u8] et al

Can't i just initialize an empty (0-filled) SIMD bitvector? And then fiddle with the bits afterwards?

Bitwise operations on bitvectors makes sense, but what kind of arithmetic would you want to do on these types?

+, -, abs, *, /, rem, mod, <, <=, >, >=, =, /=, <<, >> both signed and unsigned. Can worry about the implementation after there is a basic type to impl a trait on...

Nov 30 '17 14:11 flip111

Compile-time would be sufficient for me. I guess for run-time it would need a JIT. I think you are referring to RFC https://github.com/rust-lang/rfcs/blob/master/text/2000-const-generics.md

I'm pretty sure we can do it with O(1) branches per SIMD operation. AFAIK there's no reason we can't just dump illegal instructions into the binary as long as the PC never actually hits one. That means we could just have tiny jump table for SSE4.2, AVX2, and AVX512 followed by code for each.

Can't i just initialize an empty (0-filled) SIMD bitvector? And then fiddle with the bits afterwards?

Sure, the library doesn't care how you fill the vector - it just gives you a few convenient ways to express common SIMD load-store patterns.

+, -, abs, *, /, rem, mod, <, <=, >, >=, =, /=, <<, >> both signed and unsigned. Can worry about the implementation after there is a basic type to impl a trait on...

Those ops require an underlying type size for SIMD operation. Do you want to add/sub/mul/div an n-bit number to another n-bit number? That's not really what SIMD is for.

Nov 30 '17 17:11 AdamNiederer

That means we could just have tiny jump table for SSE4.2, AVX2, and AVX512 followed by code for each.

Looks good if that is an OPTION to the library. I know i always run the code on the same machine that it is compiled and it needs to be as fast as possible (without extra branching). Is there still compile time detection possible?

Those ops require an underlying type size for SIMD operation.

Yes that's what this issue is for (arbitrary sized). But for the user point of view it would be nice to have an n-bit sized type where SIMD is the underlying implementation.

Do you want to add/sub/mul/div an n-bit number to another n-bit number?

Yes exactly. For example for a 130-bit number + 10-bit number the maximum output size is 131-bits. But i think those kind of implementations can be done with a trait if the base type can support different sizes. I think the arithmetic rules can be placed in a different library. SIMD is useful here that you can fill leading unused bits with zeros (padding on MSB) and then have more bits available then just the standard 64 bits unsigned. Then the CPU takes less cycles. Obviously this would only need to kick in if the number is more than 64 bits. Ok it also means that if we have generic constant types the output type depends on the input type ... but the possible combinations are finite and actually not so many i think. (Binary size is not so important)

Nov 30 '17 17:11 flip111

Looks good if that is an OPTION to the library. I know i always run the code on the same machine that it is compiled and it needs to be as fast as possible (without extra branching). Is there still compile time detection possible?

Probably, but the jump table will waste maybe 20 cycles in a cold part of the program. People concerned about that would probably benefit from dropping down to raw intrinsics or writing assembly anyway.

Yes exactly. For example for a 130-bit number + 10-bit number the maximum output size is 131-bits. But i think those kind of implementations can be done with a trait if the base type can support different sizes. I think the arithmetic rules can be placed in a different library

SIMD won't be useful in that scenario at all - you want an arbitrary precision/size math library.

EDIT: Actually, it looks like you could do this with some AVX-512 bit hacking. I don't think it's very generalizable though.

Nov 30 '17 17:11 AdamNiederer

Probably, but the jump table will waste maybe 20 cycles in a cold part of the program

I don't understand this. When you can detect on compile time what instructions the CPU supports why is a jump table still needed?

People concerned about that would probably benefit from dropping down to raw intrinsics or writing assembly anyway.

This seems strange that to detect what instructions your CPU supports on compile time you would have to drop down to assembly? In the code in the readme it seems possible with Rust if cfg!(all(not(target_feature = "avx"), target_feature = "sse")) { ... but i guess this is the code for runtime detection?

you want an arbitrary precision/size math library

the problem with an arbitrary precision/size math library is that it's all runtime. It will add overhead and possibly even use inefficient datatypes/constructs. If i know on compile time that i want to add a 130 bitvector to a 10-bit vector it can be really optimized.

Nov 30 '17 18:11 flip111

I don't understand this. When you can detect on compile time what instructions the CPU supports why is a jump table still needed?

When you're compiling a binary to distribute, you don't know what features your users' processors support. If I compile an AVX2 binary and give it to somebody who owns a Core 2, they're not going to get a correct result. If I compile a SSE42 binary and give it to somebody with an Ice Lake chip, there's probably some performance being left on the table.

This seems strange that to detect what instructions your CPU supports on compile time you would have to drop down to assembly

I'm talking about run-time detection here - compile time detection is totally free, and run-time detection can be done from rust at nearly no cost. If a user of the library is concerned about the tiny overhead of runtime detection, however, they're probably better off not using this library (so they can manually unroll load/store loops, swizzles, etc).

Nov 30 '17 18:11 AdamNiederer

When you're compiling a binary to distribute, you don't know what features your users' processors support.

Yeah this would be a great feature to have as an option for the library. But is it also possible to detect it on compile time for the CPU which is used to compile the program with? Or possibly give it as a compilation flag so it can be distributed for specific CPU's?

If a user of the library is concerned about the tiny overhead of runtime detection, however, they're probably better off not using this library

Ok it seems that the idea of this library revolves around runtime detection. Perhaps consider compile detection or flag as a feature.

I guess it depends on the implementation of the runtime detection. If it happens once at the start of your program and then runs a specific version of the entire program it can be as fast as compile time. But if it needs to branch at every operation it's not the best solution for arbitrary sized bitvectors i think ... because those add overhead themselves on top of the instruction detection.

Nov 30 '17 19:11 flip111

faster faster copied to clipboard

Arbitrary size bitvectors

faster
faster copied to clipboard