zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Return the estimated effectiveness of dictionaries in the trainer

Open braibant opened this issue 6 years ago • 9 comments

Hi,

I am updating some code using zstd from 1.3.4 to 1.3.8, and encountered one issue in when updating client code which seems related to the changes in dictionary training.

The use case is training a dictionary on some payload type in a server, and send the dictionary to a client, and then use the dictionary to encode messages of this type from server to client. In one degenerate test case, it looks like some data that could be used to "successfully" train a dictionary in 1.3.4 is now yielding an "Error (generic)" in 1.3.8. I suspect that the training data is too small (most messages are a few bytes). I tried looking for guidance on how to train dictionaries or what restrictitions apply, and found an upper bound in https://github.com/facebook/zstd/issues/1288#issuecomment-424547876 but not much around lower bounds.

(A reproduction example I am looking at is is training a dictionary in consecutive integers from 0 to 10_000_000 represented as strings, which returns succesfully in 1.3.4, but not succesfully in 1.3.8.)

I am wondering if it could make sense to have some form of dummy dictionary exposed in the library to cater for cases where a dictionary cannot successfully be trained in a programmatic way, which would revert the behavior of e.g. ZSTD_compress_usingCDict to ZSTD_compressCCtx. In the example described above, I would like to gracefully upgrade the server/client communication to not use a dictionary if the server does not manage to build said dictionary.

braibant avatar Jan 23 '19 11:01 braibant

That's a good point, we need to look into it. We used to have issues with specific synthetic samples triggering weird corner cases, especially in the entropy stage, but they were fixed some time ago. This one seems a bit different, apparently related to the new sampling strategy.

cc @terrelln , as it seems a genuine corner case to fix.

In the meantime, for a quick work around, note that the previous dictionary builder is still present in the library. It has to be expressly invoked. See ZDICT_trainFromBuffer_cover().

Cyan4973 avatar Jan 23 '19 17:01 Cyan4973

Thanks for bringing up the edge case, I'll fix it shortly!

terrelln avatar Jan 23 '19 18:01 terrelln

Thanks for looking into a fix.

I will try invoking explicitly ZDICT_trainFromBuffer_cover in this case. In order to reproduce the previous behavior, I would need to find out what was the default value for k (my understanding is that nbThreads and splitPoint are only used for the _optimize functions), but for some reason it's not documented in zdict.h. Could you point me in the right direction for this? Thanks in advance

braibant avatar Jan 28 '19 12:01 braibant

The value for k can vary quite a bit based on the data. I'd recommend using ZDICT_optimizeTrainFromBuffer_cover(), if it is works on your data, so it is automatically selected for you. It will be slower, but you can lower steps if it is taking too long.

By the way, I've successfully reproduced the error, so we will have it fixed by the next release.

terrelln avatar Jan 30 '19 00:01 terrelln

The new dictionary builder only considers samples which are >= 8 bytes in its analysis, so it will produce a nearly-empty dictionary. The final step fails if the dictionary is < 128 bytes.

I can update the dictionary builder to work in this case, by simplifying some of the logic. But I don't expect a dictionary to be super helpful, except for maybe the dictionary's literals table. But <= 8 bytes is almost guaranteed to get worse after compression, especially because zstd's magic number is 4 bytes. So you're better off just not using a dictionary, since it won't save you anything, and even better not compressing.

terrelln avatar Jan 30 '19 22:01 terrelln

I'm not going to update fastcover to handle samples < 8 bytes. It will only make the performance of the algorithm (slightly) worse on non-degenerate data. Zstd doesn't work terribly well with very small dictionaries, which is one of the reasons why we have a lower bound on dictionary content size. In the future we may improve that, or allow header-only dictionaries without content.

You should expect the dictionary builder to fail on degenerate data like this. That is normal behavior. If the dictionary builder fails, then you should simply not use a dictionary.

However, if the dictionary builder fails, and there are actually a good number of samples that aren't tiny, please let us know and we'll fix the bug!

terrelln avatar Feb 01 '19 02:02 terrelln

Thanks. I understand that it's reasonable to fail on degenerate data, and I appreciate that it should not be fixed in this case.

However, as an API user, I am concerned about the use case where the dictionary builder fails (for any reason), because it forces to switch to different flavors of the zstd API. I can work around that in the particular stack I care about, by adding more metadata to the data stream exchange between client and servers, but having an empty dictionary (which I assume is what you describe as header only dictionaries) would be really helpful to streamline this use case.

braibant avatar Feb 01 '19 10:02 braibant

The zstd format provides the dictionary ID in the frame. When you don't use a dictionary, the ID is 0. So to detect if compression used a dictionary or not, you can check if ZSTD_getDictID_fromFrame() == 0.

@Cyan4973 and I discussed this issue, and for now, we are going to continue to allow the dictionary builder to fail. It will fail if you provide too few samples, or if all your samples are too small. We could return a "bogus" dictionary that won't help compression at all, but I don't feel that it is a great solution.

There are two actions we can take on this topic:

  1. We'll stress that the dictionary builder can fail in the documentation. If it fails, it means the dictionary built wouldn't be effective, so you should just use regular zstd compression.
  2. In the longer term, we could change the API to return a score of the "effectiveness" of the dictionary. Like returning the total compressed size with and without a dictionary. That way we can return "bogus" dictionaries, but say that they are totally ineffective. This will also help users who might not want to use a dictionary if it only provides a small benefit.

terrelln avatar Feb 01 '19 20:02 terrelln

I've improved the documentation in PR #1516.

terrelln avatar Feb 01 '19 23:02 terrelln