partd icon indicating copy to clipboard operation
partd copied to clipboard

Add zstd as a compression option?

Open mariusvniekerk opened this issue 6 years ago • 5 comments

would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.

mariusvniekerk avatar Feb 14 '19 12:02 mariusvniekerk

No objection from me.

On Thu, Feb 14, 2019 at 4:32 AM Marius van Niekerk [email protected] wrote:

would there be interest in adding zstd to partd. At lowish compression levels I've found it to have better compression and around twice as fast as snappy.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/partd/issues/30, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszKDTcoy0r4DPyrBoG47Ryb6Gapziks5vNVdPgaJpZM4a7fwP .

mrocklin avatar Feb 14 '19 14:02 mrocklin

ref: https://github.com/intake/filesystem_spec/issues/69

martindurant avatar Jul 18 '19 13:07 martindurant

@mariusvniekerk , seems like you'd have to make the PR if you want this to progress :)

martindurant avatar Apr 08 '20 13:04 martindurant

ref: intake/filesystem_spec#69

Does this mean Zstandard is now supported in Dask? Also is there some documentations as to how I would use it? As I have some zstandard compressed files filled with msgpack serialized data in chunks and would like to use Dask (multiprocessing) to speed up the read or to operate on the data without reading things in memory.

wanx4910 avatar Aug 02 '21 16:08 wanx4910

This repo is not really about data access, but about temporary store for dask. However, I can still answer your question. Caveat: msgpack is not a file format, as far as I know, but you can treat the contents of a file as a msgpack binary stream.

You should be able to load a single file of your data like

def readafile(fn):
    with fsspec.open(fn,  mode="rb", compression="zstd") as f:
        return msgpack.load(f)

Then, you could make a set of delayed functions for dask to work on, one chunk per file, in parallel

import dask
output = [dask.delayed(readafile)(fn) for fn in filenames]

Now the question becomes: what do you want to do with this data?

By the way: Zstd supports internal streams and block which can, in theory, provide near random-access. Dask/fsspec does not support this, so you cannot read a single file chunk-wise using the method above. However, msgpack does support streaming object-by-object, so you could change the function to work that way (which much lower memory usage), if you intend to output just aggregated values from each file.

martindurant avatar Aug 02 '21 16:08 martindurant