go-snappystream icon indicating copy to clipboard operation
go-snappystream copied to clipboard

Potential benchmark data source: Silesia corpus

Open bmatsuo opened this issue 10 years ago • 2 comments

I was reading about lz4 and noticed they use a specific dataset for the published benchmarks.

It looks like a good bunch of data files to benchmark with. But it's probably too big to include in the repository. And some of the files may have licensing restrictions.

The snappy-go benchmarks use the flag package and download their benchmark files. We could do the same thing here with Silesia corpus.

What do you think, @mreiferson?

bmatsuo avatar Sep 07 '14 15:09 bmatsuo

Hmmmm, I'm not sure I see much benefit to using a different dataset than snappy-go as really all we want to test in this package is how much additional overhead we're creating, right?

mreiferson avatar Sep 08 '14 16:09 mreiferson

Those seem like a decent set of files too. There are fairly limited in size; all less than 1MB it seems. Iirc there's a hard max on the size of a snappy block (aside from framing format and machine limitations). Small files are good for repo size and data transfer. But in the last comment I read from you you mentioned something about long streams..

Comparing overhead seems a bit like apples and oranges to me. A framed/streaming compression format can encode things a block-compression format cannot. It would be interesting. But to me it doesn't seem to have significant performance optimization benefits. I could very well be missing something.. But, in my mind, data variety is more important.

bmatsuo avatar Sep 09 '14 05:09 bmatsuo