zstd icon indicating copy to clipboard operation
zstd copied to clipboard

For help: I want to replace zlib with zstd in my project (like zsync) , it works now, but compress speed is quite slow.

Open sisong opened this issue 2 years ago • 21 comments

When decompression, local uncompressed & downloaded compressed data blocks inputs must be supported to alternate;
so I used ZSTD_DCtx_RefPrefix() when decompression, and used ZSTD_CCtx_RefPrefix() when compression.
Method 0,
Compress Code:

s=ZSTD_createCCtx();
ZSTD_CCtx_setParameter(s,ZSTD_c_compressionLevel,21);
ZSTD_CCtx_setParameter(s,ZSTD_c_windowLog,23); 
dictSize=(1<<23); //8M
blockSize=1024*2; //2k
for (i=0;i<blocks.size();++i){
	block=blocks[i];
	ZSTD_CCtx_reset(s,ZSTD_reset_session_only);
	ZSTD_CCtx_refPrefix(s,block.dataBuf-dictSize,dictSize);//NOTE: blocks[i+1].dataBuf=blocks[i].dataBuf+blockSize;
	ZSTD_inBuffer  s_input ={block.dataBuf,blockSize,0}; 
	ZSTD_outBuffer s_output={block.codeBuf,block.codeCapacitySize,0};
	ZSTD_compressStream2(s,&s_output,&s_input,in_isEndBlock?ZSTD_e_end:ZSTD_e_flush);
	block.codeSize=s_output.pos;
}

Decompress Code:

s=ZSTD_createDCtx();
for (i=0;i<blocks.size();++i){
	block=blocks[i];
	if (block.isCanLoadFromLocalFile()){
		block.loadUncompressedFromLocalFile(block.dataBuf,blockSize);
		continue;
	}
	block.downloadCompressedFromNet(block.codeBuf,block.codeSize);
	ZSTD_DCtx_reset(s,ZSTD_reset_session_only);
	ZSTD_DCtx_refPrefix(s,block.dataBuf-dictSize,dictSize);
	ZSTD_inBuffer  s_input ={block.codeBuf,block.codeSize,0}; 
	ZSTD_outBuffer s_output={block.dataBuf,blockSize,0};
	ZSTD_decompressStream(s,&s_output,&s_input);
	assert(s_output.pos==blockSize);
}

This code works good (smaller data downloads & faster decompression than zlib), but compress speed is slow (0.004MB/S).

Method 1, I tried to use ZSTD_compressBlock()/ZSTD_insertBlock(), but got same compress speed (compressed data slightly smaller than ZSTD_compressStream2).
Compress Code:

s=ZSTD_createCCtx();
...
for (i=0;i<blocks.size();++i){
	block=blocks[i];
	ZSTD_CCtx_reset(s,ZSTD_reset_session_only);
	ZSTD_compressBegin_usingDict(s,block.dataBuf-dictSize,dictSize,21);//NOTE: blocks[i+1].dataBuf=blocks[i].dataBuf+blockSize;
	block.codeSize=ZSTD_compressBlock(s,block.codeBuf,block.codeCapacitySize,block.dataBuf,blockSize);
}

Method 2, I tested moving ZSTD_compressBegin_usingDict out of 'for loop' and removing ZSTD_CCtx_reset(...);
The compress speed is ok, but compressedDataSize is much larger than zlib's.
Was the compression context lost? Did I do something wrong?

Method 3, I tested (only for test) removing ZSTD_CCtx_refPrefix() and removing ZSTD_CCtx_reset(...) from Method 0. The compress speed is ok & compressedDataSize is ok, and the decompression quickly failed surely;
I tried ZSTD_insertBlock() for uncompressed block data (update dict windows? only for test) when decompression, now some blocks worked, some blocks failed.

How to replace zlib with zstd in my zsync like project?

sisong avatar Dec 12 '22 05:12 sisong

Could you use flush() at block boundaries instead ?

Cyan4973 avatar Dec 12 '22 08:12 Cyan4973

Yes, I can use it; I tested still call ZSTD_compressStream2(s,...,ZSTD_e_flush), is run ok and same slow speed.
over 99% CPU time in rebuild match tree, but the input prefix data(8M) only changed 2k at once.

sisong avatar Dec 12 '22 11:12 sisong

I notice that you are using compression level 21. That compression level is extremely slow. Can you try level 1 or level 3?

embg avatar Dec 12 '22 16:12 embg

One other thing to be mindful of: if you do use ZSTD_CCtx_refPrefix, be aware that it is an O(dictSize) operation. If dictSize is 8MB, that is going to be very expensive.

embg avatar Dec 12 '22 16:12 embg

You seem to cut your output into multiple frames. But these frames are not independent, since they depend on previous history (8 MB in your example). So, effectively, there is very little point in creating separated frames. This could be a single frame, with arbitrarily selected flush points (every 2 KB in your example).

In which case, there is no need to employ refPrefix() between flush points. You can just continue streaming compression, using the selected 8 MB compression window. And it would not result in "the same speed", but be drastically faster. Compression ratio should remain similar, if not slightly better.

What's unclear to me is the need to cut input into very small 2 KB blocks. Random access comes to mind, but this is invalidated by the fact that these blocks are not independent. Furthermore, there is a large imbalance between the 8 MB history window size and the small 2 KB block size. I fail to understand the benefits of this approach.

Cyan4973 avatar Dec 12 '22 19:12 Cyan4973

"it is an O(dictSize) operation" Yes, this is that's the reason for slow compressing. Thanks @embg

"Random access comes to mind" Yes, I need ”Random access“ compressed blocks when decompress (Note the Decompress Code of Method 0) @Cyan4973
Using refPrefix I have already supported it excellent(ignore slow compressing);
i.e. some compressed blocks (less, & download from Internet by http(s)), together with uncompressed blocks (most, & load from local file), restore to uncompressed full data.

sisong avatar Dec 12 '22 23:12 sisong

"Random access comes to mind" Yes, I need ”Random access“ compressed blocks when decompress (Note the Decompress Code of Method 0) @Cyan4973 Using refPrefix I have already supported it excellent(ignore slow compressing);

I don't understand that statement. It seems the dictionary required for block N contains all previous blocks produced before N up to 8 MB of depth. This, by definition, makes block N dependent on a substantial quantity of previous blocks.

The decompression code mentioned in original request only works because all blocks are decompressed in order in a streaming fashion. It's not clear to me how it could support random access.

Cyan4973 avatar Dec 12 '22 23:12 Cyan4973

Not really random access.

My project is sync for large file,like rsync or zsync.
server: the large file have new version B; clip B to blocks by block size(2K); blocksInfoB have infos of every block.
client: have old file version A; download blocksInfoB, found need download some changed blocks from server.
for example blocks in B [a,b,c,d,e,f,g,h]
found blocks [a,b,c,e,f] in local file A; client need download d,h,g from server;
if d,g,h are uncompressed, update old A to new version B is so easy.

To save download size, server can compress all blocks by each, at once.
server: compressed blocks in B [az,bz,cz,dz,ez,fz,gz,hz]; az=compress(a), bz=compress(b),...
client: download dz,gz,hz from server, & d=decompress(dz),g=decompress(gz),h=decompress(hz); update old A to new version B is simple too.

Continue to save download size: used prefix when compress each block,will give better compression ratio than each block compress by single.
server: compressed blocks in B [ap,bp,cp,dp,ep,fp,gp,hp];
ap=compress(a,prefix()), bp=compress(b,prefix(a)),...,gp=compress(g,prefix(b,d,e,f)),hp=compress(h,prefix(d,e,f,g));
client: download dp,gp,hp from server, & d=decompress(dp,prefix(a,b,c)),g=decompress(gp,prefix(b,d,e,f)),h=decompress(hp,prefix(d,e,f,g));
when decompress block X, X's prefix is always ready; the prefix datas is loaded from local file or downloaded from server.
update old A to new version B is ok, & with smaller download size.

zlib can used deflate()+deflatePrime()/inflateSetDictionary()+inflate() to implement this optimization, and run very fast.
My project has used ZSTD_DCtx_RefPrefix()/ZSTD_CCtx_RefPrefix() to implement this optimization,
but ZSTD_CCtx_RefPrefix() run very slow when run with 8M prefix.

sisong avatar Dec 13 '22 05:12 sisong

Loading an 8 MB dictionary to compress a single 2 KB block is a disproportionate effort. Especially when using very high compression levels. Compared to loading such large dictionary, the 2 KB block compression workload is negligible. All the cpu cost is into loading the dictionary.

For comparison, zlib uses only a 32 KB dictionary, so that's considerably smaller, hence considerably less energy.

You could start at comparable levels, by using smaller dictionaries, and testing size increases up to the moment it feels uncomfortable.

Another complementary strategy is to employ lower compression levels. --ultra levels develop dynamic binary trees while loading dictionaries, which is extremely expensive. It's worth it for a dictionary which is re-used multiple times, but when it's used only once, that's a huge cost. If you were to employ something simpler, like a hash row (used in lazy2 strategy), loading the dictionary would become considerably faster. Try smaller compression levels, you are likely going to experience a cliff at strategy changes. Notable cliffs should be expected at level 15 (btlazy2) and 12 (lazy2) for dictionaries > 256 KB.

If you want to keep your compression strategy as described, all speed optimizations should be focused on dictionary loading.

Note that there are other possible strategies, such as using the --patch-from mode of zstd, which finds and compresses only the difference between 2 versions, assuming both encoding and decoding sides have access to the old version. Note that its managed history window is limited to < 2 GB for the time being, but for small enough data that fit into the window, it can be an interesting reference to compare to.

Cyan4973 avatar Dec 13 '22 08:12 Cyan4973

Thank you for your reply and advice. @Cyan4973
Them have a cost, such as lowering the compression rate or using large of memory when decompressing.

And zlib used O(2k) time for compress every block, not O(dictSize) time.

I have thought of 2 ways

  1. now build new dict tree is O(dictSize) time, it is very expensive.
    Can provide a more fine-grained api to update 2k data on an already built dict tree, using O(2k) time. For example:
prefixDict=ZSTD_build_prefixDict(windowLog,init_prefix_datas);  //0<=sizeof(init_prefix_datas)<=(1<<windowLog) 
ZSTD_CCtx_refPrefix(cctx,prefixDict);
......
ZSTD_update_prefixDict(prefixDict,next_2k_prefix_datas);
  1. use the current ZSTD_compressStream2, & not need use any additional dict or prefix; but provide a new api when decompressing:
//update the internal dict as if this data had been decompressed by dctx  (into `dctx` history)
size_t ZSTD_DCtx_insertDeompressed(ZSTD_DCtx* dctx,const byte* deompressed, size_t deompressedSize); 

sisong avatar Dec 13 '22 12:12 sisong

I tested zlib and zstd refPrefix compression speed

test file linux_5.19.9.src.tar, filesize 1269637120 byte  compress by 4k block:

zlib  level  6  dict 16k, compressed size 233999388 byte  time    25.5s
zlib  level  6  dict 32k, compressed size 220927951 byte  time    28.9s
zlib  level  9  dict 32k, compressed size 217322763 byte  time    49.2s
zstd  level 13  dict 16k, compressed size 232242432 byte  time   194.3s
zstd  level 13  dict 32k, compressed size 219546349 byte  time   368.9s
zstd  level 22  dict 32k, compressed size 208475681 byte  time   725.3s
zstd  level  3  dict  8m, compressed size 215998750 byte  time  1982.4s
zstd  level 13  dict  1m, compressed size 176035926 byte  time 17581.4s

sisong avatar Dec 19 '22 02:12 sisong

With a 32 KB dictionary size, I would expect the compression levels to use this table. In which case, level 13 is still using a very slow binary tree.

Assuming compression levels are indeed using this table, I would expect a first noticeable impact at level 12, with btlazy2, which is still a binary tree, but with a lazy resolution, which should load dictionary faster. Then, the next large performance impact would happen at level 10, with lazy2, which totally gets rid of the binary tree.

Since there is some uncertainty as to which level uses what, I would recommend testing multiple compression levels, to detect interesting inflection points.

Cyan4973 avatar Dec 19 '22 03:12 Cyan4973

The point is not the level, but the complexity of the algorithm, O(dictSize) is not acceptable.
Had to drop support for zstd in my project.

sisong avatar Dec 19 '22 04:12 sisong

  1. If speed is important to you, you should avoid using very high compression levels.
  2. To be clear, zlib is also O(dictSize), they're just limited to very small dictionaries.

You should try an apples-to-apples comparison, using the same compression level and the same dictionary size between zstd and zlib.

I'm also curious how you've built/integrated zstd. Are you building it yourself? Have you built it with -O3 -DNDEBUG?

felixhandte avatar Dec 19 '22 18:12 felixhandte

Source code compiled at the same time by VC++ 2019 with O2 & NDEBUG

test file linux_5.19.9.src.tar, filesize 1269637120 byte,  compress by 2k block:
zlib  level  6  dict 16k, compressed size 243128143 byte  time 28.6s
zlib  level  6  dict 32k, compressed size 229640798 byte  time 30.7s
zstd  level 10  dict 16k, compressed size 243642768 byte  time 40.4s
zstd  level 10  dict 32k, compressed size 230432524 byte  time 86.3s

My project is sync project; I want the server to compress data by smaller blockSize with larger dictSize & level , and only once.
all client users download blocks from server, those missing locally; client insert local uncompressed prefix data to dctx for decompress downloaded compressed block.

(disregarding level&blocksize for now)
The time complexity of running this compress test, zstd is O(fileSize)*O(dictSize), zlib is O(fileSize).
Because zlib supports inflateSetDictionary(), which can insert uncompressed data into dctx decompressed history.

I'm now trying to implement a similar function in zstd: size_t ZSTD_DCtx_insertDecompressed(ZSTD_DCtx* dctx,const byte* decompressed, size_t decompressedSize);
(I did an experiment before: compressed the local data using the same parameters and gave it to dctx, it worked fine. But decompress failed after lowering the compress level.)

sisong avatar Dec 20 '22 01:12 sisong

How can a functionality which is only active during decompression make any difference to compression speed ?

Cyan4973 avatar Dec 20 '22 02:12 Cyan4973

Because a similar function in zstd now is ZSTD_DCtx_refPrefix(dctx,prefix,dictSize);
which need data be compressed with ZSTD_CCtx_refPrefix(cctx,prefix,dictSize) to match, and the time complexity of ZSTD_CCtx_refPrefix is O(dictSize), so total compress time is O(fileSize)*O(dictSize).
I want have ZSTD_DCtx_insertDecompressed, which only need ZSTD_compressStream2(cctx,out,inBlockData,ZSTD_e_flush) to match, total time is normal O(fileSize).

sisong avatar Dec 20 '22 03:12 sisong

OK, so this is back to employing streaming + flushing on the compression side.

Faking a ZSTD_DCtx_insertDecompressed() method could be quite easy, just prefix the blocks with a fake block header signifying that it's an uncompressed block, and it should be good to go.

Employing this method, the main difference is that it produces blocks, instead of producing frames. Blocks are dependent on previous blocks by default, while frames are independent by default.

Unfortunately, dependency extends to a few more concepts :

  • History Window, the most obvious
  • RepCodes, that may be employed during the first 3 sequences
  • Literals and Sequences statistics, that may be re-employed

And, while it's theoretically possible to take care of all these dependencies and ensure that a newly produced block only depends on History Window, this is not what is currently happening in the reference implementation, and requires some work to make it happen.

Cyan4973 avatar Dec 20 '22 16:12 Cyan4973

@Cyan4973, yeah this is a topic that's come up at least once before that I can recall. This is of course an abuse of the zstd frame format... but it doesn't seem like it would be that hard to implement.

felixhandte avatar Dec 20 '22 17:12 felixhandte

This could indeed be done through advanced parameters. Invalidating previous statistics would be easy.

Invalidating repcodes is more complex though, it has been done for the Multi-threading mode, so there is precedence, but currently implementation has restrictions (only at the start of a frame, and only in "single contiguous buffer" mode).

Cyan4973 avatar Dec 20 '22 17:12 Cyan4973

You could also implement it by inspecting the first few sequences after the matchfinding has run.

felixhandte avatar Dec 20 '22 18:12 felixhandte

@Cyan4973 @felixhandte Thank you for your prompt and detailed response.
I may have solved the issue, if no bugs were found.
I tried to construct the compression code with local uncompressed data and give it to the decompressor, and it works correctly, but when decompressing the downloaded compressed data, most of the blocks fail, and the failure location is usually after decompressing a portion of the data. This means that there are dependencies in the encoding between adjacent blocks.

I then tried another idea to dilute the construction cost of the dictionary. Referring to ZSTD_createCDict_byReference() I implemented the functions:

ZSTD_CDict* ZSTD_createCDict_delta(const void* dictBuffer,size_t dictSize,size_t allDeltaSize,
                                   int compressionLevel,unsigned long long pledgedSrcSize);
size_t ZSTD_updateCDict_delta(ZSTD_CDict* cdict,const void* dictDelta, size_t dictDeltaSize);

for an 8MB dictionary,
I construct it by submitting only 7M: call ZSTD_createCDict_delta(dictBuffer,8M,1M,..)
and then submits 2kb each time after compress each block: call ZSTD_updateCDict_delta(cdict,block,2K) 512 times max
Then the dictionary needs to be recreate (needs recreate because I don't know how to remove the earliest submitted data) all source code at https://github.com/sisong/zstd/commit/56326dd6477dc283da875f7827e8712cabdf40b0

sisong avatar Dec 22 '22 01:12 sisong

Sharing the dictionary initialization cost over a larger batch of blocks is certainly a good strategy, that will help compression speed.

It seems your selected strategy depends on the existence of new methods, notably ZSTD_updateCDict_delta(). As mentioned, we don't plan to support mutable or multi-steps dictionary loaders in the general libzstd. Of course, you are totally free to fork zstd on your side to make it happen nonetheless.

Cyan4973 avatar Jan 07 '23 00:01 Cyan4973