libzim icon indicating copy to clipboard operation
libzim copied to clipboard

User-defined compression level

Open data-man opened this issue 4 years ago • 16 comments

Currently libzim uses 19 compression level. It is very slow. Some tests. File: wikinews_en_all_maxi_2022-02.zim Original file size: 241261224

Level Recompressed size Real time User time Sys time
-21 278293573 0m19.586s 0m35.266s 0m1.184s
-15 271330653 0m19.353s 0m35.027s 0m1.281s
-10 264416066 0m19.661s 0m35.495s 0m1.150s
-5 253267981 0m19.547s 0m35.245s 0m1.119s
0 222953585 0m19.312s 0m35.419s 0m1.144s
5 221996722 0m19.445s 0m37.480s 0m1.155s
10 218533321 0m20.196s 0m44.233s 0m1.258s
15 217358451 0m23.549s 1m12.978s 0m2.781s
19 215803250 0m41.705s 2m48.551s 0m5.143s

Notes:

  • tests are done in ramdisk with this script:
#!/bin/bash
for c in {-21,-15,-10,-5,0,5,10,15,19}
do
  echo "$c"
  time zimrecreate wikinews_en_all_maxi_2022-02.zim l$c.zim -l $c 1> /dev/null

Proposals:

  • add configCompressionLevel parameter to zim::Creator (can be negative)
  • add setCompressionLevel parameter to zim::Creator (can be negative) - for non-fluent style
  • add -l <number> (--compression_level) parameter to zimrecreate tool

data-man avatar Apr 23 '21 08:04 data-man

@data-man This is done one purpose because we have only one compression (on a good server) and all the rest is decompressions (most of the time on low end systems). What is exactly the problem with the current libzim? You can not wait the creation of the ZIM file?

kelson42 avatar Apr 23 '21 11:04 kelson42

@kelson42

You can not wait the creation of the ZIM file?

Yes.

And I plan to use libzim in other applications.

data-man avatar Apr 23 '21 11:04 data-man

I think the key point is how to do that in an elephant manner, considering that we can have multiple compression algorithms?

I'm not really in favour of such a feature.... But if there is an elegant patch.... Why not!

kelson42 avatar Apr 23 '21 17:04 kelson42

@kelson42

elephant manner

No, I hope it's shrew manner. :smile:

we can have multiple compression algorithms?

WARNING: LZMA compression method is deprecated. Support for it will be dropped from libzim soon.

Which compression methods are planned?

But if there is an elegant patch

Yeah, it's trivial patch.

data-man avatar Apr 23 '21 17:04 data-man

"elegant" was meant ;)

kelson42 avatar Apr 26 '21 08:04 kelson42

As @kelson42 said, the "classical" use case is to generate a zim file once and then download/use a zim file several times. So we set the compression value to reduce the size of the zim file at the cost of compression time. But I agree we should not enforce this use case. Especially that, as shown in the small benchmark, using a smaller compression value increase a bit the compressed size but improve a lot the compression time.

The used algorithm for now use a integer to set the compression level. And it is probable that other compression algorithms do the same (and if not, we can define ours).

The simpler would be to add a int parameter to configCompression and just pass it to the compression algorithm. If not provided, we will use a default compression level (maximum compression, different (adapted) for each compression algorithm). It would be to the user to pass a coherent value if he which to force a compression level.

mgautierfr avatar Apr 27 '21 09:04 mgautierfr

The simpler would be to add a int parameter to configCompression

What if this parameter will equal to some special value (INT_MIN or INT_MAX) by default?

kelson42 assigned data-man

Wow! Thanks! :)

data-man avatar Apr 30 '21 10:04 data-man

What if this parameter will equal to some special value (INT_MIN or INT_MAX) by default?

It would be better to have two methods. On with a compression level and one without (calling the first one with the correct default value for the compression algorithm)

mgautierfr avatar May 03 '21 15:05 mgautierfr

@mgautierfr so basically one with default compression level and an other one with the max level?

kelson42 avatar May 03 '21 19:05 kelson42

One without compression level (using the default(max)), one with compression level (no default, it is up to the user to specify it)

mgautierfr avatar May 04 '21 08:05 mgautierfr

And one more question: what about lzma?

data-man avatar May 04 '21 09:05 data-man

Same. The default will change depending of the compression algorithm. (This is why we cannot have a default argument and we need a intermediate method without compression level).

mgautierfr avatar May 04 '21 09:05 mgautierfr

I'm working on this. Soon™.

data-man avatar Oct 23 '21 23:10 data-man

@mgautierfr What about this:

enum class CompressionLevel: int {
  MINIMUM,
  DEFAULT,
  MAXIMUM
};

enum class LZMACompressionLevel: int {
  MINIMUM = 0,
  DEFAULT = 5,
  MAXIMUM = 9
};

enum class ZSTDCompressionLevel: int {
  MINIMUM = -21,
  DEFAULT = 3,
  MAXIMUM = 21
};

I've already implemented it.

data-man avatar Oct 25 '21 16:10 data-man

I don't think we need the first enum. The values should be specific to a compression algorithm.

Other enums are useful to help user to know which value to use. But they must be purely informational. The configCompression must take a int as parameter.

mgautierfr avatar Oct 26 '21 08:10 mgautierfr

The results are updated with the latest versions of libzim and zstd.

data-man avatar Feb 27 '22 15:02 data-man