polars
polars copied to clipboard
Lazy scanning of compressed ndjson files
Problem description
The decompress feature in the Rust crate seems to only apply CSV and not JSON. However, it is relatively common practice to store data in .jsonl.gz batch files, JSON benefits greatly from compression.
It is far from ergonomic, but with JsonLineReader it is at least in principle possible to pass flate2::read:GzDecoder to JsonLineReader::new if it is wrapped in a type that implements MmapBytesReader. Whereas LazyJsonLineReader can only be instantiated with a path.
It might be nice for instance to have a more general API for instantiating LazyJsonLineReader with an Iterator<Item=&str> of JSON records. Or even better, instantiating LazyFrame from Iterator<Item=DataFrame> or Iterator<Item=Row> as with DataFrame with the rows feature. I suppose one could implement AnonymousScan but, again, it is not very ergonomic and is not well documented.
Also regarding IO consistency, I believe that wildcard paths to read multiple files don't work with the JSON as they do with CSV.
Similar to the CSV reader, we can't scan compressed files. But we can add compression to the JsonLineReader, and wildcard to JazyJsonLineReader.
As far as a more generic API, We could have some default implementations of AnonymousScan for Iterator<Item=DataFrame> and Iterator<Item=Row>.
I think the Iterator<Item=&str> would be much trickier, as the json implementation is operating on the entire dataset, not single records, so a new implementation would need to be added.
Thanks, that sounds good.
Why can't compressed files be scanned? Is it because they can't be chunked to read them in parallel?
A default implementation of AnonymousScan for Iterator<Item=DataFrame> seems fairly simple a priory, and there's nothing stopping a user from implementing an Iterator<Item=DataFrame> for reading a compressed file in batches.
In terms of performance, the decompression and deserialisation overhead could be mitigated by reading from multiple batch files in parallel (as is common) and using a bounded misc::sync_channel (std or crossbeam) to gather them for Iterator::next. It is true that parallelisation doesn't seem practical in the single-file case.
I'd like to know generally how AnonymousScan is supposed to be used, it's not really documented. Is parallelisation handled by the user of AnonymousScan in LazyFrame::anonymous_scan or is it supposed to be managed within AnonymousScan implementations? Looking at CoreJsonReader it seems like the latter, but I generally feel like I'm missing some implicit assumptions for correct implementation of the trait.