vortex
vortex copied to clipboard
FSSTCompressor
Ran the compress_taxi benchmark, got ~80% slower. I am a bit surprised that the biggest culprit seems to be creating new counters in the FSST training loop. That doesn't even scale w.r.t. to the size of the input array, it's just a flat 2MB allocation. The zeroing of the vector seems to be the biggest problem. I think we can avoid that with a second bitmap, let me try that out
Alright, using the change in https://github.com/spiraldb/fsst/pull/21 helped a lot.
New benchmark result:
end to end - taxi/compress
time: [100.73 ms 101.72 ms 102.96 ms]
change: [-45.073% -43.470% -42.057%] (p = 0.00 < 0.05)
Performance has improved.
Which is about 10ms or ~11% slower than running without FSST.
And I think we can go even lower, ideally we'd just use the trained compressor over the samples to compress the full array
Just bear in mind that the samples can be very small compared to data, i.e. 1024 elements. I would say just retrain it
Ok I've done a few things today
- Introduced a way to reuse compressors in our samplingcompressor code
- Keep tweaking some things on the FSST side, including matching how the paper author's sample the full input, and tried to reduce memory accesses/extraneous instructions as much as possible (in draft at https://github.com/spiraldb/fsst/pull/24)
- I'm trying to run some comparisons against the C++ code. Here's a screenshot comparing a microbenchmark for compressing the
commentscolumn from the TPCH orders table (1.5mm rows) using Rust vs the C++ implementation. We seem roughly on-par with the C++ implementation here, the timings seemed consistent after several runs
Ok I added a new benchmark now which just compresses the comments column in-memory via Vortex, and i'm seeing it take ~500ms, which is roughly 2-3x longer than just doing the compression without Vortex.
I think the root of the performance difference is the chunking. Here's a comparison between running FSST over the comments column chunked as per our TPC loading infra (nchunks=192) and the canonicalized version of the comments array, which is not chunked:
So somewhere I guess there's some fixed-size overhead in FSST training (probably a combo of allocations and double-tight-loops over 0...511) that when you try and run FSST hundreds of times, they start to add up and can skew your results.
I'm curious how DuckDB and other folks deal with FSST + chunking, it seems like we might want to treat it as a special thing that can do its own sampling + have shared symbol table across chunks
I'm currently blocking this on some work in https://github.com/spiraldb/fsst/pull/24
Currently 59% of fsst_compress time is spent actually compressing, we break out of the fast loop to do push_null and data copying. Something to improve on in flup