Performance optimisation for `cobra_apply`?
I was just playing around with this library and was having a look at the cobra_apply function. It recalculates a.reverse_bits() >> ((block_width - 1).leading_zeros()); an awful lot due to the number of loops, and it showed up as a big chunk in the flame chart.
Block width is fixed at 128, and a (and b and c where relevant) are also always just 0..BLOCK_WIDTH. I tried just extracting it into a lookup table:
static FORWARD: [usize; 128] = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
];
static REVERSED: [usize; 128] = [
0, 64, 32, 96, 16, 80, 48, 112, 8, 72, 40, 104, 24, 88, 56, 120, 4, 68, 36, 100, 20, 84, 52,
116, 12, 76, 44, 108, 28, 92, 60, 124, 2, 66, 34, 98, 18, 82, 50, 114, 10, 74, 42, 106, 26, 90,
58, 122, 6, 70, 38, 102, 22, 86, 54, 118, 14, 78, 46, 110, 30, 94, 62, 126, 1, 65, 33, 97, 17,
81, 49, 113, 9, 73, 41, 105, 25, 89, 57, 121, 5, 69, 37, 101, 21, 85, 53, 117, 13, 77, 45, 109,
29, 93, 61, 125, 3, 67, 35, 99, 19, 83, 51, 115, 11, 75, 43, 107, 27, 91, 59, 123, 7, 71, 39,
103, 23, 87, 55, 119, 15, 79, 47, 111, 31, 95, 63, 127,
];
and then replacing all the similar
for a in 0..BLOCK_WIDTH {
let a_rev = a.reverse_bits() >> ((BLOCK_WIDTH - 1).leading_zeros());
with
for (a, a_rev) in FORWARD.into_iter().zip(REVERSED) { ...
and I got 25 - 30% gains across log_n = 15..20.
The cobra test still seems to pass fine. Is this a valid optimisation?
EDIT: actually just doing REVERSED.into_iter().enumerate() is both more elegant and worth another couple of percent on my laptop.
I also note that declaring REVERSED as static, rather than const harms performance by a few percent.
But it would be good if these benches could be double checked by someone else to make sure this isn't all just artefactual on on my particular set up or something...
That seems legit to me. I'd appreciate a PR I could directly benchmark and inspect the generated assembly.
We might even be able to create a const FORWARD: [usize; BLOCK_WIDTH] = ... without hardcoding it thanks to Rust's const evaluation, but I'm fine with either approach.
If we're just sequentially reading the numbers from a table, we can probably vectorize the code if the compiler isn't doing that automatically already.
I think that PR should get you started in terms of seeing what I've done. Note my edits in the OP re getting rid of FORWARD.
Part of the reason I came back to look at this was to think about how it could be vectorised from the point of view of fearless_simd, but I get a bit stuck in all the nested loops and the conditional byte swapping.
Simply processing the data in 128-bit chunks with something like chunks_exact or as_chunks might already be enough for the compiler to vectorize the code. Wider chunks don't really make a lot of sense since even AVX2 only allows shuffling within 128-bit chunks and not aross them.