zstd icon indicating copy to clipboard operation
zstd copied to clipboard

zstd_btultra corner case

Open terrelln opened this issue 4 years ago • 18 comments

zstd_btultra performs nearly 2x worse than zstd_opt on a edge case file.

> seq 100000 999999 > seqs
> zstd seqs -b16e19i0            
16#seqs              :   6300000 ->    321618 (19.59),  15.0 MB/s , 817.0 MB/s 
17#seqs              :   6300000 ->    323943 (19.45),  10.6 MB/s , 831.3 MB/s 
18#seqs              :   6300000 ->    618382 (10.19),  8.60 MB/s , 778.1 MB/s 
19#seqs              :   6300000 ->    618096 (10.19),  7.31 MB/s , 779.3 MB/s

terrelln avatar Jan 07 '21 20:01 terrelln

One difference I can think of is using fractional bit evaluation (btultra) instead of full bit approximation (btopt). It could end up triggering a different optimization gradient, which ends up being incorrect (?)

Cyan4973 avatar Jan 07 '21 21:01 Cyan4973

One difference I can think of is using fractional bit evaluation

That was my thought, since the regression is caused by btultra having 2x more literals. But I disabled fractional bits and didn't see a difference.

I'm hunting down every difference between btopt and btultra to find out what is causing it.

terrelln avatar Jan 07 '21 21:01 terrelln

This is the minimal changes that fix the issue. Making one change or the other doesn't do anything, you need both.

diff --git a/lib/compress/zstd_opt.c b/lib/compress/zstd_opt.c
index e55c459d..995a8251 100644
--- a/lib/compress/zstd_opt.c
+++ b/lib/compress/zstd_opt.c
@@ -24,10 +24,10 @@
 *  Price functions for optimal parser
 ***************************************/
 
-#if 0    /* approximation at bit level */
+#if 1    /* approximation at bit level */
 #  define BITCOST_ACCURACY 0
 #  define BITCOST_MULTIPLIER (1 << BITCOST_ACCURACY)
-#  define WEIGHT(stat)  ((void)opt, ZSTD_bitWeight(stat))
+#  define WEIGHT(stat,opt)  ((void)opt, ZSTD_bitWeight(stat))
 #elif 0  /* fractional bit accuracy */
 #  define BITCOST_ACCURACY 8
 #  define BITCOST_MULTIPLIER (1 << BITCOST_ACCURACY)
@@ -1076,7 +1076,7 @@ ZSTD_compressBlock_opt_generic(ZSTD_matchState_t* ms,
 
             if (cur == last_pos) break;
 
-            if ( (optLevel==0) /*static_test*/
+            if ( (1||optLevel==0) /*static_test*/
               && (opt[cur+1].price <= opt[cur].price + (BITCOST_MULTIPLIER/2)) ) {
                 DEBUGLOG(7, "move to next rPos:%u : price is <=", cur+1);
                 continue;  /* skip unpromising positions; about ~+6% speed, -0.01 ratio */

With this diff it is:

16#seqs              :   6300000 ->    321644 (19.59),  15.6 MB/s , 865.0 MB/s 
17#seqs              :   6300000 ->    323967 (19.45),  11.6 MB/s , 885.4 MB/s 
18#seqs              :   6300000 ->    183556 (34.32), 10.33 MB/s , 889.6 MB/s 
19#seqs              :   6300000 ->    182603 (34.50),  8.37 MB/s , 891.8 MB/s 

So there seems to be some sort of pricing issue. My guess is literal pricing, since the alphabet size is 10.

terrelln avatar Jan 07 '21 21:01 terrelln

I think the root cause is actually a case of not being to evaluate repcode choice.

At some point, zstd_btultra decides to take an offset instead of adding 2 literals + a repcode. I have to figure out why it made that choice.

That offset messes up the recode history (rep0 = 2 lits + ml 5, rep1 = 1 lit + ml 6). But, from that point on "repcode 0" is so heavily favored that repcode0 + 2 lits is favored over repcode1 + 1 lit.

The optimal parser doesn't know that using repcode1 puts it back into repcode0.

terrelln avatar Jan 07 '21 21:01 terrelln

Running my multi-arrivial branch which stores the 8 best arrivals for each position, and checks non-repcodes for the best arrival, and repcodes for every arrival. It is very slow, but does much better in this corner case.

> ./zstd -b16e22 ~/datasets/seqs
16#seqs              :   6300000 ->    227770 (27.66),  3.25 MB/s , 781.7 MB/s 
17#seqs              :   6300000 ->    233354 (27.00),  3.03 MB/s , 781.2 MB/s 
18#seqs              :   6300000 ->    149662 (42.09),  1.86 MB/s , 789.2 MB/s 
19#seqs              :   6300000 ->    149389 (42.17),  1.85 MB/s , 788.8 MB/s

terrelln avatar Jan 07 '21 22:01 terrelln

I was testing this use case with a smaller sequence set, from 10000 to 99999, and was expecting to witness roughly the same outcome, with just a size difference.

It doesn't : ranking of btopt and btultra is reversed (compared to sequences 100000 to 999999)

levels Ratio Note
1-2 x2.3 does not find match
3 - 7 x22 good ! greedy and lazy
8 - 15 x10 worse :( lazy2 and btlazy2
16 - 17 x7 even worse... btopt
18 - 22 x19 better, but not "best" btultra

It shows that, depending on gradiant decent, itself dependent on initialization and, well, almost random luck, then one or the other get stuck into a "good" or "bad" state with regards to repcode selection, and then just stays there.

None of them is excellent though, and your multi-arrivial experiment show that it could be much better.

Cyan4973 avatar Jan 07 '21 23:01 Cyan4973

It shows that, depending on gradiant decent, itself dependent on initialization and, well, almost random luck, then one or the other get stuck into a "good" or "bad" state with regards to repcode selection, and then just stays there.

Yeah, it seems to be just getting lucky or not lucky.

I don't remember why I stopped working on the multi-arrivial. Probably just got busy & it was still very slow. I'll try to pick it up at some point, and working on speeding it up (or reserving it for its own strategy above zstd_btultra & using it at level 19+).

terrelln avatar Jan 08 '21 01:01 terrelln

@senhuang42 this file would be an interesting test case for block splitting with RLE.

It is extremely regular so you should be able to build large blocks that are RLE for literal length, offset, and match length, and repeat for the literals because they should all be about uniform probability.

There are ~900K sequences. With the multi-arrival parser all but 4562 are offset=0, ll=1, ml=6. You should be able to split blocks and improve the compression ratio significantly.

terrelln avatar Jan 08 '21 16:01 terrelln

@senhuang42 this file would be an interesting test case for block splitting with RLE.

How are these sequences generated? Is this something from datagen?

senhuang42 avatar Jan 08 '21 16:01 senhuang42

How are these sequences generated? Is this something from datagen?

With the command seq, as in : seq 10000 99999

With the multi-arrival parser all but 4562 are offset=0, ll=1, ml=6.

The issue is that the current parser is not yet able to generate sequences that clean.

Cyan4973 avatar Jan 08 '21 17:01 Cyan4973

With block splitter:

./zstd ~/seqs -b16e19i0                                                                                                                                                 ✔  15:23:05
16#seqs              :   6300000 ->    174128 (36.18), 10.36 MB/s , 610.5 MB/s 
17#seqs              :   6300000 ->    170336 (36.99),  8.01 MB/s , 611.2 MB/s 
18#seqs              :   6300000 ->    391405 (16.10),  5.92 MB/s , 394.2 MB/s 
19#seqs              :   6300000 ->    394144 (15.98),  4.52 MB/s , 451.4 MB/s 

Does indeed seem to compress a lot better.

senhuang42 avatar Jan 08 '21 20:01 senhuang42

Nice! Is it emitting RLE blocks?

terrelln avatar Jan 08 '21 20:01 terrelln

I don't think it can emit RLE blocks per say, since it would require the source to be a single byte repeated many times,

but it may be able to emit compressed blocks with sequences encoded in RLE mode. This however is a bit more difficult to assess (requires accessing the compressed block's sequence header).

Cyan4973 avatar Jan 08 '21 20:01 Cyan4973

but it may be able to emit compressed blocks with sequences encoded in RLE mode. This however is a bit more difficult to assess (requires accessing the compressed block's sequence header).

Yeah that is what I meant. It should be printed in the debug logs by ZSTD_selectEncodingType().

terrelln avatar Jan 08 '21 20:01 terrelln

Nice! Is it emitting RLE blocks?

Yeah, the entropy builder is choosing RLE a decent number of times (both the in the estimator and entropy compressor, I believe)

../lib/compress/zstd_compress_sequences.c: Selected set_compressed 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_compressed 
../lib/compress/zstd_compress_sequences.c: Selected set_compressed 
../lib/compress/zstd_compress_sequences.c: Selected set_compressed 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat 
../lib/compress/zstd_compress_sequences.c: Selected set_repeat

senhuang42 avatar Jan 08 '21 20:01 senhuang42

What is that log representing? I don't see any RLE in it.

terrelln avatar Jan 08 '21 20:01 terrelln

Oh nevermind, I think I had confused RLE and Repeat, I'll check again.

senhuang42 avatar Jan 08 '21 20:01 senhuang42

I added set_rle to the logs now, almost all of them are set_rle.

../lib/compress/zstd_compress_sequences.c: Selected set_rle 
../lib/compress/zstd_compress_sequences.c: Selected set_rle 
../lib/compress/zstd_compress_sequences.c: Selected set_rle 
../lib/compress/zstd_compress_sequences.c: Selected set_rle 

senhuang42 avatar Jan 08 '21 20:01 senhuang42