vortex icon indicating copy to clipboard operation
vortex copied to clipboard

FSSTCompressor

Open a10y opened this issue 1 year ago • 4 comments

a10y avatar Aug 20 '24 23:08 a10y

image

Ran the compress_taxi benchmark, got ~80% slower. I am a bit surprised that the biggest culprit seems to be creating new counters in the FSST training loop. That doesn't even scale w.r.t. to the size of the input array, it's just a flat 2MB allocation. The zeroing of the vector seems to be the biggest problem. I think we can avoid that with a second bitmap, let me try that out

a10y avatar Aug 21 '24 14:08 a10y

Alright, using the change in https://github.com/spiraldb/fsst/pull/21 helped a lot.

New benchmark result:

end to end - taxi/compress
                        time:   [100.73 ms 101.72 ms 102.96 ms]
                        change: [-45.073% -43.470% -42.057%] (p = 0.00 < 0.05)
                        Performance has improved.

Which is about 10ms or ~11% slower than running without FSST.

a10y avatar Aug 21 '24 15:08 a10y

And I think we can go even lower, ideally we'd just use the trained compressor over the samples to compress the full array

a10y avatar Aug 21 '24 15:08 a10y

Just bear in mind that the samples can be very small compared to data, i.e. 1024 elements. I would say just retrain it

robert3005 avatar Aug 21 '24 16:08 robert3005

Ok I've done a few things today

  1. Introduced a way to reuse compressors in our samplingcompressor code
  2. Keep tweaking some things on the FSST side, including matching how the paper author's sample the full input, and tried to reduce memory accesses/extraneous instructions as much as possible (in draft at https://github.com/spiraldb/fsst/pull/24)
  3. I'm trying to run some comparisons against the C++ code. Here's a screenshot comparing a microbenchmark for compressing thecomments column from the TPCH orders table (1.5mm rows) using Rust vs the C++ implementation. We seem roughly on-par with the C++ implementation here, the timings seemed consistent after several runs

image

a10y avatar Aug 23 '24 03:08 a10y

Ok I added a new benchmark now which just compresses the comments column in-memory via Vortex, and i'm seeing it take ~500ms, which is roughly 2-3x longer than just doing the compression without Vortex.

I think the root of the performance difference is the chunking. Here's a comparison between running FSST over the comments column chunked as per our TPC loading infra (nchunks=192) and the canonicalized version of the comments array, which is not chunked:

image

So somewhere I guess there's some fixed-size overhead in FSST training (probably a combo of allocations and double-tight-loops over 0...511) that when you try and run FSST hundreds of times, they start to add up and can skew your results.

I'm curious how DuckDB and other folks deal with FSST + chunking, it seems like we might want to treat it as a special thing that can do its own sampling + have shared symbol table across chunks

a10y avatar Aug 23 '24 04:08 a10y

I'm currently blocking this on some work in https://github.com/spiraldb/fsst/pull/24

a10y avatar Aug 23 '24 17:08 a10y

image

Currently 59% of fsst_compress time is spent actually compressing, we break out of the fast loop to do push_null and data copying. Something to improve on in flup

a10y avatar Sep 03 '24 19:09 a10y