User-defined compression level
Currently libzim uses 19 compression level. It is very slow.
Some tests.
File: wikinews_en_all_maxi_2022-02.zim
Original file size: 241261224
| Level | Recompressed size | Real time | User time | Sys time |
|---|---|---|---|---|
| -21 | 278293573 | 0m19.586s | 0m35.266s | 0m1.184s |
| -15 | 271330653 | 0m19.353s | 0m35.027s | 0m1.281s |
| -10 | 264416066 | 0m19.661s | 0m35.495s | 0m1.150s |
| -5 | 253267981 | 0m19.547s | 0m35.245s | 0m1.119s |
| 0 | 222953585 | 0m19.312s | 0m35.419s | 0m1.144s |
| 5 | 221996722 | 0m19.445s | 0m37.480s | 0m1.155s |
| 10 | 218533321 | 0m20.196s | 0m44.233s | 0m1.258s |
| 15 | 217358451 | 0m23.549s | 1m12.978s | 0m2.781s |
| 19 | 215803250 | 0m41.705s | 2m48.551s | 0m5.143s |
Notes:
- tests are done in ramdisk with this script:
#!/bin/bash
for c in {-21,-15,-10,-5,0,5,10,15,19}
do
echo "$c"
time zimrecreate wikinews_en_all_maxi_2022-02.zim l$c.zim -l $c 1> /dev/null
Proposals:
- add
configCompressionLevelparameter tozim::Creator(can be negative) - add
setCompressionLevelparameter tozim::Creator(can be negative) - for non-fluent style - add
-l <number> (--compression_level)parameter tozimrecreatetool
@data-man This is done one purpose because we have only one compression (on a good server) and all the rest is decompressions (most of the time on low end systems). What is exactly the problem with the current libzim? You can not wait the creation of the ZIM file?
@kelson42
You can not wait the creation of the ZIM file?
Yes.
And I plan to use libzim in other applications.
I think the key point is how to do that in an elephant manner, considering that we can have multiple compression algorithms?
I'm not really in favour of such a feature.... But if there is an elegant patch.... Why not!
@kelson42
elephant manner
No, I hope it's shrew manner. :smile:
we can have multiple compression algorithms?
WARNING: LZMA compression method is deprecated. Support for it will be dropped from libzim soon.
Which compression methods are planned?
But if there is an elegant patch
Yeah, it's trivial patch.
"elegant" was meant ;)
As @kelson42 said, the "classical" use case is to generate a zim file once and then download/use a zim file several times. So we set the compression value to reduce the size of the zim file at the cost of compression time. But I agree we should not enforce this use case. Especially that, as shown in the small benchmark, using a smaller compression value increase a bit the compressed size but improve a lot the compression time.
The used algorithm for now use a integer to set the compression level. And it is probable that other compression algorithms do the same (and if not, we can define ours).
The simpler would be to add a int parameter to configCompression and just pass it to the compression algorithm. If not provided, we will use a default compression level (maximum compression, different (adapted) for each compression algorithm).
It would be to the user to pass a coherent value if he which to force a compression level.
The simpler would be to add a int parameter to configCompression
What if this parameter will equal to some special value (INT_MIN or INT_MAX) by default?
kelson42 assigned data-man
Wow! Thanks! :)
What if this parameter will equal to some special value (INT_MIN or INT_MAX) by default?
It would be better to have two methods. On with a compression level and one without (calling the first one with the correct default value for the compression algorithm)
@mgautierfr so basically one with default compression level and an other one with the max level?
One without compression level (using the default(max)), one with compression level (no default, it is up to the user to specify it)
And one more question: what about lzma?
Same. The default will change depending of the compression algorithm. (This is why we cannot have a default argument and we need a intermediate method without compression level).
I'm working on this. Soon™.
@mgautierfr What about this:
enum class CompressionLevel: int {
MINIMUM,
DEFAULT,
MAXIMUM
};
enum class LZMACompressionLevel: int {
MINIMUM = 0,
DEFAULT = 5,
MAXIMUM = 9
};
enum class ZSTDCompressionLevel: int {
MINIMUM = -21,
DEFAULT = 3,
MAXIMUM = 21
};
I've already implemented it.
I don't think we need the first enum. The values should be specific to a compression algorithm.
Other enums are useful to help user to know which value to use. But they must be purely informational.
The configCompression must take a int as parameter.
The results are updated with the latest versions of libzim and zstd.