python-zstandard icon indicating copy to clipboard operation
python-zstandard copied to clipboard

Docs: compression level is not actually negative?

Open thatch opened this issue 3 years ago • 1 comments

https://python-zstandard.readthedocs.io/en/latest/compressor.html says

level – Integer compression level. Valid values are all negative integers through 22.

But the usage elsewhere in the file, both param default and the "Typically usage is as follows" section don't use negative integers. The compressed size when using positive <n> matches the zstd commandline tool when using -<n>.

Actually passing a negative integer results in much worse compression. Is this just a docs bug, or a param handling bug as well? If you give me some of the intent here I'm happy to attempt a fix.

zstandard 0.18.0 from binary wheel on python 3.10

thatch avatar Jul 01 '22 18:07 thatch

The zstandard library supports negative compression levels. These sacrifice a lot of compression ratio in return for better (de)compression speed.

This feature was added after these bindings were initially written. So documentation may not properly reflect the existence of negative compression levels.

As for the zstd command line tool, I'm unsure how it handles negative levels, if at all. Parsing negative integers from command line arguments can definitely be brittle due to ambiguity of how to parse -!

indygreg avatar Jul 01 '22 19:07 indygreg

When you use -<n> with the zstd command line tool you get a positive compression level of n. To get a negative level of n you need to use --fast=<n>.

markopy avatar Sep 24 '22 16:09 markopy

I think the request here is that we

  1. Make the documentation match the behavior
  2. Make it harder to accidentally use the fast levels (a separate param for fast levels, or constant you have to or in to get them? This would be a breaking change, but I don't know how prevalent fast levels are)

thatch avatar Sep 25 '22 14:09 thatch

Why do you think the documentation doesn't match the behavior though? Having the level parameter go from negative levels up to positive 22 is pretty straight forward, no?

The compression ratio increases through all levels and the speed drops which is intuitive.

markopy avatar Sep 27 '22 00:09 markopy

Why do you think the documentation doesnt match the behavior

It currently says, at the url in the OP, default is 3 and the param is

Integer compression level. Valid values are all negative integers through 22.

The negative part is incorrect, and relying on that documentation will produce poor compression. Additionally it is not clear what to pass for most compression or best performance. Here's what the cpython gzip module says, which describes something similar unambiguously.

The compresslevel argument is an integer from 0 to 9 controlling the level of compression; 1 is fastest and produces the least compression, and 9 is slowest and produces the most compression. 0 is no compression. The default is 9.

thatch avatar Sep 27 '22 01:09 thatch

The zstd C API uses int for the compression level and negative values are allowed. I believe negative values with an absolute value greater than some threshold are rounded to some limit. But I believe that actual limit is not defined in the API anywhere.

Anyway, I suspect that the reason the CLI requires --fast is because there is ambiguity when parsing negative integers as process arguments. Especially since -<N> (e.g. -4) was historically accepted - before the existence of negative compression levels in the C API - as the mechanism to specify the compression level. It's just not worth the complexity to try to shoehorn negative integer parsing into the argument parser.

But at the C API level (and Python level), int was always used and there is no ambiguity about whether an int is positive or negative. So there's no need to opt into these faster levels.

I don't see a problem with the current Python API design. It mimics the C API's. If there's any room to improve it is around documentation. But as @markopy stated, the behavior of [-N, 22] being a sliding scale that trades speed for compression ratio seems reasonable to me. Maybe this should be stated explicitly in more places?

indygreg avatar Sep 27 '22 04:09 indygreg