PhastFT icon indicating copy to clipboard operation
PhastFT copied to clipboard

Performance optimisation for `cobra_apply`?

Open mfreeborn opened this issue 3 months ago • 3 comments

I was just playing around with this library and was having a look at the cobra_apply function. It recalculates a.reverse_bits() >> ((block_width - 1).leading_zeros()); an awful lot due to the number of loops, and it showed up as a big chunk in the flame chart.

Block width is fixed at 128, and a (and b and c where relevant) are also always just 0..BLOCK_WIDTH. I tried just extracting it into a lookup table:

static FORWARD: [usize; 128] = [
    0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
    26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
    50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73,
    74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97,
    98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
    117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127,
];

static REVERSED: [usize; 128] = [
    0, 64, 32, 96, 16, 80, 48, 112, 8, 72, 40, 104, 24, 88, 56, 120, 4, 68, 36, 100, 20, 84, 52,
    116, 12, 76, 44, 108, 28, 92, 60, 124, 2, 66, 34, 98, 18, 82, 50, 114, 10, 74, 42, 106, 26, 90,
    58, 122, 6, 70, 38, 102, 22, 86, 54, 118, 14, 78, 46, 110, 30, 94, 62, 126, 1, 65, 33, 97, 17,
    81, 49, 113, 9, 73, 41, 105, 25, 89, 57, 121, 5, 69, 37, 101, 21, 85, 53, 117, 13, 77, 45, 109,
    29, 93, 61, 125, 3, 67, 35, 99, 19, 83, 51, 115, 11, 75, 43, 107, 27, 91, 59, 123, 7, 71, 39,
    103, 23, 87, 55, 119, 15, 79, 47, 111, 31, 95, 63, 127,
];

and then replacing all the similar

    for a in 0..BLOCK_WIDTH {
        let a_rev = a.reverse_bits() >> ((BLOCK_WIDTH - 1).leading_zeros());

with

for (a, a_rev) in FORWARD.into_iter().zip(REVERSED) { ...

and I got 25 - 30% gains across log_n = 15..20.

The cobra test still seems to pass fine. Is this a valid optimisation?

EDIT: actually just doing REVERSED.into_iter().enumerate() is both more elegant and worth another couple of percent on my laptop.

I also note that declaring REVERSED as static, rather than const harms performance by a few percent.

But it would be good if these benches could be double checked by someone else to make sure this isn't all just artefactual on on my particular set up or something...

mfreeborn avatar Nov 15 '25 15:11 mfreeborn

That seems legit to me. I'd appreciate a PR I could directly benchmark and inspect the generated assembly.

We might even be able to create a const FORWARD: [usize; BLOCK_WIDTH] = ... without hardcoding it thanks to Rust's const evaluation, but I'm fine with either approach.

If we're just sequentially reading the numbers from a table, we can probably vectorize the code if the compiler isn't doing that automatically already.

Shnatsel avatar Nov 15 '25 15:11 Shnatsel

I think that PR should get you started in terms of seeing what I've done. Note my edits in the OP re getting rid of FORWARD.

Part of the reason I came back to look at this was to think about how it could be vectorised from the point of view of fearless_simd, but I get a bit stuck in all the nested loops and the conditional byte swapping.

mfreeborn avatar Nov 15 '25 15:11 mfreeborn

Simply processing the data in 128-bit chunks with something like chunks_exact or as_chunks might already be enough for the compiler to vectorize the code. Wider chunks don't really make a lot of sense since even AVX2 only allows shuffling within 128-bit chunks and not aross them.

Shnatsel avatar Nov 15 '25 15:11 Shnatsel