Allow ID placeholders in dictionary filenames to automatically select the correct dictionary

Open TomasRiker opened this issue 2 years ago • 0 comments

Imagine you have a large set of cached small items (e.g. results of user queries) that are compressed using zstd and its dictionary feature. As the cached data evolves over time, so may the dictionaries. Some files may still use an old dictionary, others may use a newer one that has been built with more recent data for better compression. When we want to decompress an item from this cache, we have to look at each file, figure out the dictionary it uses and then pass that dictionary to zstd.

To make such things easier, I'm proposing to support dictionary ID placeholders when specifying a dictionary filename (during training and decompression). This would allow zstd to automatically use the correct dictionary when multiple dictionaries are in use.

It would work like this:

At training (not so useful, but read on): zstd --train -o dict_%x --dictID=100000 FILE ...

The placeholder %x in the dictionary filename would be replaced with the dictionary ID in hexadecimal. Another possible placeholder could be %d for decimal.

You could then have multiple dictionary files sitting next to each other, each with its ID encoded in its filename.

The real advantage is at decompression: zstd -d -D dict_%x FILE ...

Now we don't have to worry about the fact that files may use different dictionaries, we can even decompress all files at once. For each file to be decompressed, zstd would automatically use the correct dictionary file based on the dictionary ID stored in the file.

Jun 22 '23 05:06 TomasRiker