acl Store the mutable compression data in SOA layout

We copy the reference tracks into mutable tracks we can work with and that we can mutate: we need to normalize them, quantize them, etc.

We also copy them to remove any stride and iterate faster on them with a lower cache impact. It also ensures alignment for SIMD access for every element.

We should also swizzle the data into Structure of Array layout. This would allow operations to process 4 elements at a time instead of 1. Or 8 with AVX. Dual pumping would be easy as well. ISPC will be a trivial transition.

Converting the rotations is a trivial fit. Extracting the ranges is a trivial fit. Compacting constant tracks can be achieved by instead duplicating the constant value and retaining everything with the same width or some other mechanism. Normalization is a trivial fit. With the quantization cache, decaying is a trivial fit. So is the actual quantization which can just copy from the cache directly.

Aug 07 '19 01:08 nfrechette

This is taking a bit longer than I thought, the changes are quite extensive but most of them are done now. I am debugging the remaining issues. A lot of the code is now a lot more streamlined. Fewer memory allocations are made and the code is all setup to be easily executed in parallel once multithreading support is introduced. Everything is setup for easy ISPC execution and a comment to that effect is present in every location where it could easily benefit.

Some parts already process 8 elements at a time, interleaved, and could process 16 with AVX.

Testing, profiling, and cleanup remains. I expect this to be done by the end of October at the latest.

Sep 13 '19 14:09 nfrechette

After a lot of work and time, the results turned out to be very underwhelming. VS2019 is a good 5% faster than VS2017 to compress with the same code and VS2017 is very sensitive to changes. Sometimes, minor changes will trigger inlining to change dramatically on some distant code which can harm performance for no reason.

The SOA branch ended up about 5% faster than the develop branch but it includes a number of changes that are unrelated but necessary. These changes could be done for AOS too and are required for parallel compression but both SOA and parallel compression will be put on ice for now. Compression is fast enough for now.

SOA decay turned out to be slower overall. More often than now, we early out after testing 1-2 samples and as such we throw away a lot of work. The computation heavy code is the error measurement and the bit rate optimization which SOA doesn't appear to help at all and it fact it might hurt it even without SOA decay. Without SOA decay, we are forced to perform scalar loads for each vector3/4 component to reconstruct it. I tried AVX2 gather but on my Ryzen CPU that is massively slower (it seems that Intel would do a lot better here).

Oct 27 '19 22:10 nfrechette