go-snappystream
go-snappystream copied to clipboard
Potential benchmark data source: Silesia corpus
I was reading about lz4 and noticed they use a specific dataset for the published benchmarks.
It looks like a good bunch of data files to benchmark with. But it's probably too big to include in the repository. And some of the files may have licensing restrictions.
The snappy-go benchmarks use the flag package and download their benchmark files. We could do the same thing here with Silesia corpus.
What do you think, @mreiferson?
Hmmmm, I'm not sure I see much benefit to using a different dataset than snappy-go
as really all we want to test in this package is how much additional overhead we're creating, right?
Those seem like a decent set of files too. There are fairly limited in size; all less than 1MB it seems. Iirc there's a hard max on the size of a snappy block (aside from framing format and machine limitations). Small files are good for repo size and data transfer. But in the last comment I read from you you mentioned something about long streams..
Comparing overhead seems a bit like apples and oranges to me. A framed/streaming compression format can encode things a block-compression format cannot. It would be interesting. But to me it doesn't seem to have significant performance optimization benefits. I could very well be missing something.. But, in my mind, data variety is more important.