compress icon indicating copy to clipboard operation
compress copied to clipboard

Switch from DataDog/zstd to valyala/gozstd in tests and benchmarks

Open valyala opened this issue 5 years ago • 14 comments

DataDog/zstd has poor support for streaming, has potential bugs in streaming and is less optimized than valyala/gozstd. So I'd recommend switching to valyala/gozstd in tests and benchmarks.

Also valyala/gozstd vendors the latest upstream zstd release without any modifications.

valyala avatar Jun 02 '19 22:06 valyala

I wouldn't mind, especially since it performs measurably worse the DataDog's - both in terms of compression and speed ;)

Before I add it, I would just like to make sure that this is expected. I have:

file    out level   insize  outsize millis  mb/s
enwik9	zskp	1	1000000000	348027537	7499	127.16
enwik9	zstd	1	1000000000	358512784	5527	172.55
enwik9	zstd	2	1000000000	332265719	6497	146.79

gob-stream	zskp	1	1911399616	272529084	6956	262.02
gob-stream	zstd	1	1911399616	249054924	5069	359.54
gob-stream	zstd	2	1911399616	236984664	5297	344.06

adresser.001	zskp	1	1073741824	18235536	1211	845.58
adresser.001	zstd	1	1073741824	16758988	885	1155.76
adresser.001	zstd	2	1073741824	16920671	1092	937.73

rawstudio-mint14.tar	zskp	1	8558382592	3800966951	43398	188.07
rawstudio-mint14.tar	zstd	1	8558382592	3609530535	31977	255.24
rawstudio-mint14.tar	zstd	2	8558382592	3504208748	36672	222.57

10gb.tar	zskp	1	10065157632	5001038195	58193	164.95
10gb.tar	zstd	1	10065157632	4941154808	43283	221.77
10gb.tar	zstd	2	10065157632	4856098868	45106	212.81

I think only gob-stream level 1 is smaller, the rest are larger, and the speed is quite a bit lower than https://github.com/klauspost/compress/tree/master/zstd#performance

Is that the cost of adding CRC (in terms of speed?)

Using master/f50fb7adf06ebfa200c0146f4de069f9f42b693d

In terms of benchmarking I am not doing much more than io.Copy from file -> readahead -> compressor -> io.Discard. The file is loaded in RAM.

Oh yeah, the Datadog version I am using is v1.4.0 / 809b919c325d7887bff7bd876162af73db53e878.

klauspost avatar Jun 03 '19 18:06 klauspost

The performance difference looks unexpected.

Is that the cost of adding CRC (in terms of speed?)

Probably this is related to different API functions used for stream compression:

I'll try avoiding copying to internal buffer in valyala/gozstd if the buffer passed to Write exceeds internal buffer size. Probably this will improve performance for the use case when the data is fully loaded in RAM before passing to Write.

valyala avatar Jun 04 '19 12:06 valyala

I think I found the main issue that explains performance difference and compressed size difference: the readahead implements WriteTo method, which is short-circuited inside io.Copy, so it tends to pass big chunks of read ahead data (up to 1MB by default) to compressor.Write. DataDog/zstd passes these chunks directly to C.ZSTD_compressContinue on each Write call, while valyala/zstd copies them into smaller internal buffer (its' size is defined in zstd sources as ZSTD_CStreamInSize() => ZSTD_BLOCKSIZE_MAX => 1<<17, i.e. 8 times smaller then the buffer passed to Write) and calls C.ZSTD_compressStream on the internal buffer. This results in 8x more cgo calls.

I'm unsure about changing the internal buffer size inside valyala/gozstd, since zstd docs say that ZSTD_CStreamInSize() returns the recommended size for input buffer:

ZSTDLIB_API size_t ZSTD_CStreamInSize(void);    /**< recommended size for input buffer */

valyala avatar Jun 04 '19 15:06 valyala

Probably ZSTD_CStreamInSize should return bigger value, since it results in better compression speed and ratio? cc'ing @Cyan4973 for the input.

valyala avatar Jun 04 '19 16:06 valyala

ZSTD_compressStream() makes all the necessary buffering internally. It will try to consume as much as it can, and produce as much as possible, then report these quantities to the caller. Whatever doesn't qualify as a full block just gets buffered internally, and waits for more input (or for flush()).

This is very different from ZSTD_compressContinue() which acts immediately, consuming all the input, forcing the block limits at immediate end of input, and failing with an error code if it can't find enough place for the output. It's also dependent on the caller buffering correctly past data, which is a major contract complexity. ZSTD_compressContinue() is designed for memory-constrained devices and systems, and will likely never reach "stable" API status.

ZSTD_compressStream() is the officially supported streaming API of Zstandard. It's compatible with advanced parameters. In this context, ZSTD_CStreamInSize() is just a recommended size. It's in no way required to follow it, the streaming API ZSTD_compressStream() accepts any input size, including very large ones.

The "recommended" size should be understood in the context of a C caller. It doesn't consider the cost of crossing dynamic language interfaces, such as cgo or jni. For such cases, a major performance rule is to reduce the number of domain traversal to a minimum, hence favor very large input and output buffers, as much as practical.

Cyan4973 avatar Jun 04 '19 16:06 Cyan4973

Yann, thanks for the explanation! I'll add the ability to configure internal buffer size for streaming compression by providing new API in valyala/gozstd for this use case.

@klauspost , could you run fuzz tests for valyala/gozstd like you dit it for DataDog/zstd before starting using it?

valyala avatar Jun 04 '19 16:06 valyala

I seem to be getting this error when I try to fuzz it:

e:\gopath\src\github.com\klauspost\compress\zstd>go-fuzz-build -tags=decompress github.com/klauspost/compress/zstd
failed to execute go build: exit status 2
# github.com/valyala/gozstd
gcc: error: e:/temp/wintemp/go-fuzz-build292968055/gopath/src/github.com/valyala/gozstd/libzstd_windows_amd64.a: No such file or directory

Before I start digging - do you know how I fix this?

klauspost avatar Jun 04 '19 16:06 klauspost

It looks like Windows support is broken, since nobody used it since the addition :( @zhuyie , could you look into Windows support for github.com/valyala/gozstd?

In the mean time try running make clean libzstd.a from the root of the repo. It should build the missing file. See the FAQ for details.

valyala avatar Jun 04 '19 17:06 valyala

I think it is specific for go-fuzz (notice how it moves the sources). The ordinary build process works fine.

It is basically running this script:

https://github.com/klauspost/compress/blob/master/zstd/testdata/fuzz-decompress.cmd

Here is the fuzzing function:

https://github.com/klauspost/compress/blob/master/zstd/fuzz_decompress.go

Just replace the import and you should be going. You can add additional fuzz targets here:

https://github.com/facebook/zstd/releases/tag/fuzz-corpora goes in here: https://github.com/klauspost/compress/tree/master/zstd/testdata/fuzz-decompress/corpus

klauspost avatar Jun 04 '19 17:06 klauspost

Testing the compressor is kindda pointless, as you might as well do that from your own package (and go fuzz doesn't help much since it is C code and it cannot mark coverage).

Comparing decompressors, there will we differences. Some values are up to the implementation and I am not as strict on some other values (which I should fix, but not super important).

klauspost avatar Jun 04 '19 17:06 klauspost

@klauspost , could you re-run benchmarks against gozstd v1.5.1? It contains the updated libzstd_windows_amd64.a file for zstd 1.4.0. Previously this file was built with a year-old zstd code. This may explain the performance difference.

Additionally, could you provide a guideline on how to run these benchmarks, so next time I could run them myself?

valyala avatar Jun 23 '19 10:06 valyala

@valyala With 1.5.1 I see this package mostly delivering the same performance as "valyala/gozstd", except in highly repetitive loads for streams.

zskp: this package. Level 1= Fastest, Level 2 = Default zstd: "valyala/gozstd". Level passed as is

file	out	level	insize	outsize	millis	mb/s
enwik9	zskp	1	1000000000	343833033	5968	159.80
enwik9	zskp	2	1000000000	317838745	8252	115.57
enwik9	zstd	1	1000000000	358072021	5646	168.91
enwik9	zstd	3	1000000000	313734672	8297	114.94

file	out	level	insize	outsize	millis	mb/s
gob-stream	zskp	1	1911399616	234981983	5275	345.50
gob-stream	zskp	2	1911399616	208669457	6829	266.89
gob-stream	zstd	1	1911399616	249810424	4968	366.85
gob-stream	zstd	3	1911399616	208192146	6234	292.40

file	out	level	insize	outsize	millis	mb/s
consensus.db.10gb	zskp	1	10737418240	4076120264	43277	236.62
consensus.db.10gb	zskp	2	10737418240	3992225762	71119	143.98
consensus.db.10gb	zstd	1	10737418240	4167622462	43778	233.90
consensus.db.10gb	zstd	3	10737418240	3962744141	74509	137.43

file	out	level	insize	outsize	millis	mb/s
adresser.json	zskp	1	7983034785	228519697	12536	607.26
adresser.json	zskp	2	7983034785	253567362	15446	492.89
adresser.json	zstd	1	7983034785	218077590	8297	917.59
adresser.json	zstd	3	7983034785	239094031	10702	711.38

I am using this for testing (no warranty) https://gist.github.com/klauspost/d98da4c469ac30c47c3b0405d84e3828

I use it in a script like this:

go build compress.go
compress -in=%1 -out=* -stats -header=false -w="gzkp" -l=0

compress -in=%1 -out=* -stats -header=true -w="zskp" -l=1 >>results.txt
compress -in=%1 -out=* -stats -header=false -w="zskp" -l=2 >>results.txt
compress -in=%1 -out=* -stats -header=false -w="zstd" -l=1 >>results.txt
compress -in=%1 -out=* -stats -header=false -w="zstd" -l=3 >>results.txt

The first line is to load the file into cache.

klauspost avatar Jul 11 '19 09:07 klauspost

Actually testing out with DataDog/zstd, also yielded similar (lower) results:

file	out	level	insize	outsize	millis	mb/s
enwik9	zskp	1	1000000000	343833033	5972	159.69
enwik9	zskp	2	1000000000	317838745	8185	116.51
enwik9	zstd	1	1000000000	392429215	6403	148.94
enwik9	zstd	3	1000000000	368760828	7754	122.99

This seems to be down to this addition which removes the io.WriterTo interface from the input.

So it seems that when the input doesn't supply io.WriterTo and io.Copy is used, the performance of the Datadog package takes a big hit. So in most cases this package performs similar to cgo. I will re-run benchmarks on your package and post it below.

Edit: It doesn't seem like your package gains the same speedup. Re-adding the interface doesn't provide the any speedup:

file	out	level	insize	outsize	millis	mb/s
enwik9	zskp	1	1000000000	343833033	6008	158.71
enwik9	zskp	2	1000000000	317838745	8242	115.71
enwik9	zstd	1	1000000000	358072021	5661	168.46
enwik9	zstd	3	1000000000	313734672	8533	111.76

file	out	level	insize	outsize	millis	mb/s
gob-stream	zskp	1	1911399616	234981983	5329	342.06
gob-stream	zskp	2	1911399616	208669457	6865	265.53
gob-stream	zstd	1	1911399616	249810424	5014	363.48
gob-stream	zstd	3	1911399616	208192146	6421	283.89

file	out	level	insize	outsize	millis	mb/s
consensus.db.10gb	zskp	1	10737418240	4076120261	43303	236.47
consensus.db.10gb	zskp	2	10737418240	3992225759	70478	145.29
consensus.db.10gb	zstd	1	10737418240	4167622462	43689	234.38
consensus.db.10gb	zstd	3	10737418240	3962744141	90530	113.11

Link to test files are here: https://github.com/klauspost/compress/pull/134

If you have a test-set (preferably a few GB) either as a single stream or as a collection of files I could test against your specific payload type. You can upload files here: https://goobox.io/#/upload

klauspost avatar Jul 11 '19 10:07 klauspost

Here's another reason to not use cgo by default for builds - the playground won't work, since gcc isn't installed:

https://play.golang.org/p/AYmVJDSWqUV

# github.com/DataDog/zstd
exec: "gcc": executable file not found in $PATH

This is unfortunate, since the playground now has proper module support, so it could be used to share example programs with badger.

mvdan avatar Jun 30 '20 17:06 mvdan