Improve decompression startup time
Hi there, thanks for open sourcing this innovative compression algorithm. I'm evaluating this for use at pganalyze. While the compression ratio is great, when many small files need to be decompressed we're seeing that just spinning up the decompressor is taking ~60% of the runtime.
This profile was captured on a dataset of 100 thousand samples of 10 discrete types of data, which required 1 million decompress calls. Note: to get a more detailed profile I set #[inline(never)] on LatentPageDecompressor::new, and defined a separate function to generate latent_decompressors for PageDecompressorInner so it showed up in the trace. Also this is just one of the call paths in the profile (presumably one per Rust data type); the total runtime is ~4.5 seconds.
Since there are a small set of data types we're decompressing, would it be theoretically possible to cache the decompressor and reuse it for subsequent decompressions of that data type? Or is the only option to aggregate the data into fewer compressed files ahead of time?
How many numbers do you have per page? We recommend at least 1k numbers per page, at which point I wouldn't expect this to be a problem. Is there a good reason to have 100k x 10 pages here?
Pco writes metadata per chunk that is required to decode each page within the chunk. If you really need this many pages for some reason, hopefully you at least have a manageable number of chunks, and we can see if it's possible to initialize each PageDecompressor faster, given a ChunkDecompressor.
In this dataset there's an average of 214 numbers per page (assuming page = file, the Vec<u8> being decompressed). These relatively small files are from timeseries data we collect. I'll try aggregating them into a smaller number of files to optimize decompression times.
In the readme we have recommendations for the count of numbers per page and chunk - 214 is way too small for good+fast compression. Larger chunks should fix this, but let us know if they somehow don't.
I'm seeing around a 3.5x compression ratio versus standard compression methods. Compression may be similarly slow, but only decompression speed matters for this use case.
FWIW I think that part of the readme could be reworded for clarity. Users only care about files, but files are mentioned last in the sentence, and not at all in the table below it. It would help to explicitly state what the minimum size files should be.
We could reverse the sentence
Pco has a hierarchy of multiple batches per page; multiple pages per chunk; and multiple chunks per file.
to put file first. However, I personally think it's fine as it is. On the other hand I agree it might make sense to add a file-row to the table below that sentence.
It would be nice to improve startup time. I assume it hasn't been the primary focus of optimization so far, although I'm sure Martin has thought about it. Not sure if you've already identified possible improvements to the code @seanlinsley. If so, you're more than welcome to present them to us on Discord (or here if you prefer) :-)
This shouldn't be closed yet. The readme should at least be updated so you don't get more issues like this from confused users. But I'd argue this is a core limitation of the library that needs to be improved before it can be adopted by larger projects, like databases.
One optimization I have found is to not convert from PerLatentVarBuilder to PerLatentVar, instead using PerLatentVarBuilder directly. The conversion between the types is surprisingly slow. Time permitting, I was planning to write a benchmark for this problem case, and then contribute that change.
Ok, I'm reopening and will wait to see what comes of your investigation. There might indeed be a way to avoid copying when converting a PerLatentVarBuilder into PerLatentVar.
I checked the wording, and I'm not convinced anything needs to be changed. I also think users should be encouraged to think in chunks instead of files, since that's the unit of compression (and a similar principle holds for other codecs; like what even would be a "zstd file"? zstd is usually only wrapped into other formats, so it mostly only makes sense to talk about a zstd frame).
@seanlinsley any update on this?
Sorry for the delay, I was busy getting ready to open source our approach to using pco with Postgres: https://github.com/pganalyze/pco_store
The bucket_size.rs benchmark shows the improved compression size and read/write time when grouping data into a smaller number of files. I still intend to put together synthetic benchmarks for the upstream repo and work on code changes to improve performance for small batch sizes, though it's not strictly necessary for our use case.
Cool to see the benchmarks - I'm not sure what some of the numbers mean, but looks like Pco did pretty well.
@mwlon what's the purpose of the PerLatentVarBuilder, can't we just use PerLatentVar everywhere?
Otherwise, how about this to avoid copying:
impl<T> From<PerLatentVarBuilder<T>> for PerLatentVar<T> {
fn from(mut value: PerLatentVarBuilder<T>) -> Self {
PerLatentVar {
delta: value.delta.take(),
primary: value.primary.take().unwrap(),
secondary: value.secondary.take(),
}
}
}
Ok, after looking further I'm struggling to understand why the conversion would copy anything in the first place...
Edit:
Please correct me if I'm wrong. After some more reading, my understanding is that each latent is copied from PerLatentVarBuilder to PerLatentVar. Since the latents are just unsigned integers, moving and copying costs the same, so my suggestion above won't change much.
I could imagine it copying if either T contains no heap-allocations/references (so the memory layouts of T and Option<T> are different) or LLVM just isn't clever enough to optimize away the copy. It may be reasonable to make PerLatentVar.primary an option and kill the builder.
That said, probably worth confirming this is actually a performance issue. Maybe benchmark on a file with 100 ints and see what stands out?
Though if we made PerLatentVar.primary optional, it would be an API breaking change and a bit of a lie, which would be unfortunate.
Another idea, if this really is a performance issue, is to kill builder and use unsafe, uninitialized data for primary in the few cases we need it.
Here's a VTune breakdown on my Windows PC on a 100 int32 dataset.
bench -i data/binary/ --input-format binary --dtypes i32 --iters 100000000 --codecs pco --datasets interl0_small_100 --no-compress
Nice - you can also use --limit 100 instead of creating a new dataset
So I still think we should consider dropping the PerLatentVarBuilder, so we avoid unnecessary copying. However, it's not clear to me how much we'd gain from just this. However, it turns out using u16 instead of u32 for the Bitlen gives a 10-25% time reduction on my system by itself. Whether this is the way to go is unclear, but it shows that there a lot to be gained by reducing struct sizes, heap allocations, and copying.
Edit: Just putting a Box around the State is even more effective.
Performance on large pages is definitely the priority, so I don't think we should compromise it by switching bitlens to u16. This does demonstrate that copying+large stack allocations cost us for small pages, though.
The question now is what the best way to avoid copying in PerLatentVarBuilder->PerLatentVar is.
Agreed. The u16 Bitlen route does not seem favorable right now.
It seems the only effective way we can avoid copying the large structs is by moving state to the heap. Otherwise, it looks to me like we'd need a larger rewrite of decompression structs and pipeline. The crux of the problem is the copying of the State struct, which is by far the largest struct. This struct is embedded in a chain of parent structs like LatentPageDecompressor, PerLatentVarBuilder, PerLatentVar, PageDecompressorInner, PageDecompressor, etc. For each of these structs, there is a copy of the child struct into the parent struct inside each new method. I'm not sure, but the Option might introduce even more copies. Therefore, I don't think we'll gain much by just avoiding the PerLatentVarBuilder -> PerLatentVar copy.
Just boxing State effectively reduces the data being copied each time from up to 15 kb to <200 b and decreases decompression time by almost 50% for a 100 i32 file. However, changing to Box<State> does seem to hurt performance a bit for the full test files, which we really don't want. I'm not sure if there's a way we can move State (or perhaps just the arrays) to the heap while avoiding a performance hit for larger files.
There might be a clever way to reorder the initialization chain so that the inner structs are initialized inside the outer structs. What do you think?
To clarify the heap allocation thing I've been attempting to describe here and in Discord (henceforth "approach 1"): we would heap allocate the large arrays (the ones with the memory arena TODOs). We would still stack allocate all the actually small types, so we wouldn't need the chase pointers to access primitive attributes of all these types.
I think Rust+LLVM are smart enough to avoid copying if we incrementally initialize structs like let foo = Foo{...}; let bar = Bar{foo, ...}, so I wouldn't worry about that.
If this is too confusing, want to video chat?
@seanlinsley , we will be merging #286 soon, which should improve decompression time for very small datasets a lot. We achieved the improved performance by moving some structures to the heap.
When benching with --limit 100 the bottom-up views are now dominated by heap allocation and free calls.
Decompression:
Compression:
To further improve performance in scenarios with many small datasets, we could explore arenas as a way to avoid repeated heap allocations and frees.
Following up to confirm that your work has helped, thanks! Upgrading the pco_store benchmarks from pco 0.4.2 to 0.4.6 sped up decompression by ~2x for the "10 minute bucket" case which stores just 214 numbers per page: https://github.com/pganalyze/pco_store/pull/5