zstd icon indicating copy to clipboard operation
zstd copied to clipboard

[Help Wanted, Questions] Improving Dictionary Training Process

Open neiljohari opened this issue 1 year ago • 2 comments

Hi team,

We are integrating zstd + shared compression dictionaries at Roblox for serving feature flag payloads! We think this is a good use case because the payload looks similar over time (people add new flags, flip them, or delete them but week-over-week the data is very similar) and we control the client so we can ship it with the dictionary.

I've been playing around with various training parameters and set up a few harnesses to help try out different param combinations (and found previous insight here), and found a few approaches that seem to work well. However it feels a bit like blindly guessing + we aren't sure if this is the best we can do, and we were wondering if the maintainers/community have insight into how we can improve our process.

Our payloads currently are ~435kb JSON that differ by client type, though the majority of flags between clients are the same. Examples:

  • https://clientsettingscdn.roblox.com/v2/settings/application/GoogleAndroidApp
  • https://clientsettingscdn.roblox.com/v2/settings/application/iOSApp

We currently artificially expand our payload into a bunch of training files:

  • Split these files at byte boundaries (e.g. every 1kb, 2kb, 4kb) and write them to new files.
  • Make multiple copies of these files.
  • Training command: zstd --ultra -T0 -22 -f -r --train {training_dir} -o {dict_path} --maxdict {max_dict_size} --train-fastcover

We validate the compression ratio it achieves on historical payloads to see the dictionary's effectiveness over time.

Example of our training dir structure:

├── 2024-08-13
│   ├── AndroidApp.json_split_2048
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_1
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_2048_copy_2
│   │   ├── AndroidApp.json_chunk_2048_000000.json
│   │   ├── AndroidApp.json_chunk_2048_000001.json
│   │   ├── AndroidApp.json_chunk_2048_000002.json
│   │   ├── AndroidApp.json_chunk_2048_000003.json
│   │   ├── AndroidApp.json_chunk_2048_000004.json
│   ├── AndroidApp.json_split_4096
│   │   ├── AndroidApp.json_chunk_4096_000000.json
│   │   ├── AndroidApp.json_chunk_4096_000001.json
│   │   ├── AndroidApp.json_chunk_4096_000002.json
│   │   ├── AndroidApp.json_chunk_4096_000003.json
│   │   ├── AndroidApp.json_chunk_4096_000004.json
│   ├── AndroidApp.json_split_4096_copy_1
│       ├── AndroidApp.json_chunk_4096_000000.json
│       ├── AndroidApp.json_chunk_4096_000001.json
│       ├── AndroidApp.json_chunk_4096_000002.json
│       ├── AndroidApp.json_chunk_4096_000003.json
│       ├── AndroidApp.json_chunk_4096_000004.json

Some things we've noticed that we don't quite understand:

  • Having more training data even though we're intentionally trying to overfit doesn't always help. We tried chunk sizes of 128, 256, 512, 1024, 2048, 4096 and 10 copies of each which achieved 60x compression ratio on the payload. But then, tried again with only 2048 and 4096 and 1 copy of each, and got 90x compression ratio.
  • The max dict size is really sensitive, and doesn't just act as an upper bound. With 2048 & 4096 chunks with 1 copy each, a max dict size of 512KB gets 90x ratio but 550KB gets 14x ratio.

Thanks in advance for your time, we're really excited to roll this out soon and would love your insights on how we can do even better!

neiljohari avatar Aug 15 '24 01:08 neiljohari