partd
partd copied to clipboard
Add zstd as a compression option?
would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.
No objection from me.
On Thu, Feb 14, 2019 at 4:32 AM Marius van Niekerk [email protected] wrote:
would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/partd/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszKDTcoy0r4DPyrBoG47Ryb6Gapziks5vNVdPgaJpZM4a7fwP .
ref: https://github.com/intake/filesystem_spec/issues/69
@mariusvniekerk , seems like you'd have to make the PR if you want this to progress :)
Does this mean Zstandard is now supported in Dask? Also is there some documentations as to how I would use it? As I have some zstandard compressed files filled with msgpack serialized data in chunks and would like to use Dask (multiprocessing) to speed up the read or to operate on the data without reading things in memory.
This repo is not really about data access, but about temporary store for dask. However, I can still answer your question. Caveat: msgpack is not a file format, as far as I know, but you can treat the contents of a file as a msgpack binary stream.
You should be able to load a single file of your data like
def readafile(fn):
with fsspec.open(fn, mode="rb", compression="zstd") as f:
return msgpack.load(f)
Then, you could make a set of delayed functions for dask to work on, one chunk per file, in parallel
import dask
output = [dask.delayed(readafile)(fn) for fn in filenames]
Now the question becomes: what do you want to do with this data?
By the way: Zstd supports internal streams and block which can, in theory, provide near random-access. Dask/fsspec does not support this, so you cannot read a single file chunk-wise using the method above. However, msgpack does support streaming object-by-object, so you could change the function to work that way (which much lower memory usage), if you intend to output just aggregated values from each file.