lrzip icon indicating copy to clipboard operation
lrzip copied to clipboard

Any plans for updating libzpaq?

Open twekkel opened this issue 4 years ago • 11 comments

Method 4 and 5(11) use new algorithms and much stronger (and slower) compression, than currently available in lrzip.

twekkel avatar Nov 08 '19 19:11 twekkel

I updated libzpaq to 7.15. See #146

It does have 5 compression levels however, the old levels 1-3 correspond roughly to levels 3,4, and 5. Nonetheless, there are some speed improvements. However, @ckolivas , changes to the Compressor:compress method has disabled updating progress on compress. It uses Reader::read and not Reader::get. Decompress still works fine. I created a new function show_progress, but I can't get it to be called from Compressor::compress.

Please test and benchmark. Thank you.

// Compress n bytes, or to EOF if n < 0
bool Compressor::compress(int n) {
  if (state==SEG1)
    postProcess();
  assert(state==SEG2);

  const int BUFSIZE=1<<14;
  char buf[BUFSIZE];  // input buffer
  while (n) {
    int nbuf=BUFSIZE;  // bytes read into buf
    if (n>=0 && n<nbuf) nbuf=n;
    int nr=in->read(buf, nbuf);
    if (nr<0 || nr>BUFSIZE || nr>nbuf) error("invalid read size");
    if (nr<=0) return false;
    if (n>=0) n-=nr;
    for (int i=0; i<nr; ++i) {
      int ch=U8(buf[i]);
/* TODO
 * Need to show progress
      // ver 7.15 uses read instead of get
      // need to show progresss in this way
      if (!(i % 128)) 
	      show_progress(i);
*/
      enc.compress(ch);
      if (verify) {
        if (pz.hend) pz.run(ch);
        else sha1.put(ch);
      }
    }
  }
  return true;
}

pete4abw avatar Apr 05 '20 20:04 pete4abw

I only did a quick test on the highly redundant "fp.log" apache log file and I do not see a big improvement... much larger resulting file compared to zpaq or mzpq

$ ./lrzip -zf -L9 fp.log Output filename is: fp.log.lrz fp.log - Compression Ratio: 25.120. Average Compression Speed: 0.826MB/s. Total time: 00:00:23.25

$ time mzpq -9 fp.log

real 1m2,051s user 1m1,959s sys 0m0,080s

$ zpaq a fp.log.zpaq fp.log -m511 zpaq v7.15 journaling archiver, compiled Aug 24 2018 Creating fp.log.zpaq at offset 0 + 0 Adding 20.617071 MB in 1 files -method 511 -threads 8 at 2020-04-06 20:51:21. 100.00% 0:00:00 + fp.log 20617071 100.00% 0:00:00 [1..196] 20617863 -method 511,209,1 1 +added, 0 -removed.

0.000000 + (20.617071 -> 20.617071 -> 0.393798) = 0.393798 MB 70.406 seconds (all OK)

$ ls -l fp.log* 20617071 apr 6 22:40 fp.log 820728 apr 6 22:40 fp.log.lrz 393798 apr 6 22:52 fp.log.zpaq 399851 apr 6 22:43 fp.log.zpq

twekkel avatar Apr 06 '20 20:04 twekkel

@twekkel , Thank you for testing! I see you used -m511, which is a much larger block size 2^11MB-4096 (2GB) versus the default value, 2^4MB-4096 (16MB).

What this implies is a heuristic approach based on Window size and available ram may be needed to determine the best block size to use. At least the program did not blow up! Right now, the defaults are Level =1..5,4,128,0:

Higher compression levels are slower but compress better. "1" is good for most purposes. "0" does not compress. "2" compresses slower but decompression is just as fast as 1. "3", "4", and "5" also decompress slower. The numeric arguments are as follows:

N1: 0..11 = block size of at most 2^N1 MiB - 4096 bytes (default 4). N2: 0..255 = estimated ease of compression (default 128). N3: 0..3 = data type. 1 = text, 2 = exe, 3 = both (default 0)

May I ask you to try your test again with -m54 so we have a 1:1 comparison? Thank you again!

pete4abw avatar Apr 06 '20 21:04 pete4abw

thanks for the explanation... (I would prefer to see the levels to be more configurable)

zpaq -m54 (for this specific logfile) is about 3 times slower than lrzip -z -L9, but the resulting compressed file is also only half the size.

$ zpaq a fp.log.zpaq fp.log -m54 zpaq v7.15 journaling archiver, compiled Aug 24 2018 Creating fp.log.zpaq at offset 0 + 0 Adding 20.617071 MB in 1 files -method 54 -threads 8 at 2020-04-07 19:02:11. 81.39% 0:00:00 [1..161] 16760563 -method 54,209,1 100.00% 0:00:00 + fp.log 20617071 100.00% 0:00:00 [162..196] 3857308 -method 54,207,1 1 +added, 0 -removed.

0.000000 + (20.617071 -> 20.617071 -> 0.404632) = 0.404632 MB 57.358 seconds (all OK)

test file: http://www.maximumcompression.com/data/files/log-test.rar my use case is mostly to archive large amount of system/application logfiles

twekkel avatar Apr 07 '20 19:04 twekkel

@twekkel , this is instructive.What I am noticing is zpaq has code which heuristicly determines N2 and N3. N2 is ease of compression (higher = easier), and N3 is set to text only. This really does impact how well zpaq will perform. The libzpaq.cpp library does not have this code. Remember, lrzip preprocesses data prior to compression (rzip), and services a broad range of data types and compressors. Also, a 20MB file is very small. I normally test with files around 1G.

No guarantees on how to improve this. One possibility is during the lzo test. If it shows data is highly compressible, maybe we can tweak the settings. Another option is to permit method as a parameter to -z/--zpaq. But we have SO MANY OPTIONS already,

In your case, try this! lrzip -n -f -v -L9 fp.log then zpaq a fp.log.lrz.zpaq fp.log.lrz -m511

This will see if the rzip precompression will aid zpaq or not for your log files.Share your output, please.

PS. I've made small changes so that block sizes will always be maximized base on input size. I'll push them when I get to it. Thank you

pete4abw avatar Apr 07 '20 19:04 pete4abw

log -> lrz -> lrz.zpaq 20MB -> 3.2 MB -> 828772 bytes (slightly worse than -z -L9 option) -> 15.6 sec -> 12.6 sec = 28.2 seconds (so slower than -z -L9)

twekkel avatar Apr 07 '20 19:04 twekkel

I pushed some changes that will allow the max possible block size (N1). It may help. But to be honest, lrzip can't be customized for every situation. It will create code and maintenance chaos. Version 7.15 with lrzip is better than before and faster also. lrzip still beats zpaq for speed. And franckly, the difference of 20-30 MB is not serious. In fact, were your log files 2GB or so, you would see the benefits of rzip and better compression. YMMV. Thank you for testing.

PS. use -v or -vv when testing please.

pete4abw avatar Apr 07 '20 20:04 pete4abw

Pulled the changes but do not see a noticeable difference, compressed size is identical with the initial version of this branch. Also the 2 step compression approach seems identical. I'm not planning on using (much) large files than this.... so not fully utilizing the potential of the long range. I think I will stick to mzpq. Thanks for your support!

$ ./lrzip -z -vv -L9 fp.log The following options are in effect for this COMPRESSION. Threading is ENABLED. Number of CPUs detected: 8 Detected 8098709504 bytes ram Compression level 9 Nice Value: 19 Show Progress Max Verbose Temporary Directory set as: ./ Compression mode is: ZPAQ. LZO Compressibility testing enabled Heuristically Computed Compression Window: 51 = 5100MB Storage time in seconds 1366952498 Output filename is: fp.log.lrz File size: 20617071 Succeeded in testing 20617071 sized mmap for rzip pre-processing Will take 1 pass Chunk size: 20617071 Byte width: 4 Succeeded in testing 1077581679 sized malloc for back end compression Using up to 9 threads to compress up to 10485760 bytes each. Beginning rzip pre-processing phase hashsize = 4194304. bits = 22. 64MB 1036583 total hashes -- 174000 in primary bucket (16.786%) Malloced 2699567104 for checksum ckbuf Starting thread 0 to compress 1512607 bytes from stream 0 Starting thread 1 to compress 1696021 bytes from stream 1 lzo testing OK for chunk 1696021. Compressed size = 49.69% of chunk, 1 Passes lzo testing OK for chunk 1512607. Compressed size = 71.09% of chunk, 1 Passes Writing initial chunk bytes value 4 at 24 Writing EOF flag as 1 Writing initial header at 30 Compthread 0 seeking to 9 to store length 4 Compthread 0 seeking to 26 to write header Thread 0 writing 599248 compressed bytes from stream 0 Compthread 0 writing data at 39 Compthread 1 seeking to 22 to store length 4 Compthread 1 seeking to 599287 to write header Thread 1 writing 221382 compressed bytes from stream 1 Compthread 1 writing data at 599300 MD5: 059d1d86734a3478c29e862bf2026684 matches=174660 match_bytes=18921050 literals=96661 literal_bytes=1696021 true_tag_positives=97450263 false_tag_positives=53681033 inserts=3658024 match 11.156 fp.log - Compression Ratio: 25.120. Average Compression Speed: 0.792MB/s. Total time: 00:00:23.49

twekkel avatar Apr 07 '20 20:04 twekkel

Thank you for your interest and your feedback.There are improvements in compression levels generally, and certainly speed improvements. But as we learned, for very small files like yours and easily compressed ones, there is little benefit.

pete4abw avatar Apr 07 '20 21:04 pete4abw

on a 1730507223 byte CSV file I tested, lrzip with updated libzpaq does manage to do better than current lrzip, but still does worse than zpaq:

current lrzip (-z -L9 -U -p 1): 289193032 bytes lrzip with updated libzpaq (-z -L9 -U -p 1): 274051667 bytes zpaq (-m511): 264586427

lilyanatia avatar Apr 08 '20 00:04 lilyanatia

This will be the final change on this version and branch. See #146 .Further development will be conducted on my fork for the LZMA SDK 19.00 branch.Thank you for your review and testing.

I added code that will heuristically compute block size in stream.c, ease of compression and construct the method is the form of Level BlockSize, ease of compression, and data type: Lbs,E,T. Data type will always be 0. Using lrzip -zv will expose the method. The code used is:

 /* Compression level can be 1 to 5, zpaq version 7.15 */
/* Levels 1 and 2 produce worse results and are omitted */
zpaq_level = control->compression_level / 4 + 3;	/* levels 3,4, and 5 only */
/* block size is >= buffer size expressed as 2^bs MB */
zpaq_bs = 1+log2(c_size/(1024*1024));
if (zpaq_bs > 11) zpaq_bs=11;
else if (zpaq_bs < 1) zpaq_bs = 1;
zpaq_ease = 255-(compressibility * 2.55);	/* 0, hard, 255, easy. Inverse of lzo_compresses */
if (zpaq_ease < 25) zpaq_ease = 25;		/* too low a value fails */
zpaq_type = 0;					/* default, binary data */

ZPAQ Block Sizes are automatically reduced to the block size being compressed. Setting a maximum block size could be reduced anyway. This code sets the block size appropriately for compression. A small block size

A major change was to limit the Compression levels to 3,4, and 5. Not 1-5. Compression levels 1 and 2 really performed much worse across all levels. Lrzip levels will convert to zpaq levels as follows:

LRZIP Level 1 2 3 4 5 6 7 8 9
ZPAQ 7 Level 3 3 3 4 4 4 4 5 5

Some results of two different tar files comparing against the current. There is little difference in size, but speed is improved for both compression and decompression. Of course, 7.15 has bug fixes as well. Reference values for standard lrzip lzma compression/decompression are included.

File Size Comp Zpaq5 Comp Zpaq7 DeComp Zpaq5 DeComp Zpaq7 Comp LZMA DeComp LZMA Description
dm.tar 447,119,360 5.308/1:56 5.323/1:40 2:05 1:45 5.086/0:41 0:14 Digital Mars C++ compiler
linux-5.6.tar 957,614,080 10.382/3.56 10.328/3:23 3:52 3:47 8.060/1:43 0:18 Linux 5.6 Kernel Souce

/tmp/lrzip.631$ ./lrzip -zP -L7 ../dm.tar Output filename is: ../dm.tar.lrz ../dm.tar - Compression Ratio: 5.308. Average Compression Speed: 3.672MB/s.
Total time: 00:01:56.07 Average DeCompression Speed: 3.580MB/s [OK] - 447119360 bytes
Total time: 00:02:05.78

/tmp/lrzip.zpaq$ ./lrzip -zP -L7 -S .zpaq.lrz ../dm.tar Output filename is: ../dm.tar.zpaq.lrz ../dm.tar - Compression Ratio: 5.323. Average Compression Speed: 4.260MB/s. Total time: 00:01:40.45 Average DeCompression Speed: 4.019MB/s [OK] - 447119360 bytes
Total time: 00:01:45.55

/tmp/lrzip.631$ ./lrzip -zP -L7 ../linux-5.6.tar Output filename is: ../linux-5.6.tar.lrz ../linux-5.6.tar - Compression Ratio: 10.382. Average Compression Speed: 4.022MB/s. Total time: 00:03:46.57 Average DeCompression Speed: 3.918MB/s [OK] - 957614080 bytes
Total time: 00:03:52.74

/tmp/lrzip.zpaq$ ./lrzip -zP -L7 -S .zpaq.lrz ../linux-5.6.tar Output filename is: ../linux-5.6.tar.zpaq.lrz ../linux-5.6.tar - Compression Ratio: 10.328. Average Compression Speed: 4.498MB/s. Total time: 00:03:23.29 Average DeCompression Speed: 4.040MB/s [OK] - 957614080 bytes
Total time: 00:03:47.63

Reference Compression/Decompression

/tmp/lrzip.631$ ./lrzip -P -L7 -S .lzma.lrz ../dm.tar Output filename is: ../dm.tar.lzma.lrz ../dm.tar - Compression Ratio: 5.086. Average Compression Speed: 10.143MB/s. Total time: 00:00:41.24 Average DeCompression Speed: 30.429MB/s [OK] - 447119360 bytes
Total time: 00:00:14.07

/tmp/lrzip.631$ ./lrzip -P -L7 -S .lzma.lrz ../linux-5.6.tar Output filename is: ../linux-5.6.tar.lzma.lrz ../linux-5.6.tar - Compression Ratio: 8.060. Average Compression Speed: 8.864MB/s. Total time: 00:01:43.11 Average DeCompression Speed: 53.706MB/s [OK] - 957614080 bytes
Total time: 00:00:18.84

pete4abw avatar Apr 10 '20 13:04 pete4abw