fst Benchmark on well-known data sets

trafficstars

And add a section on benchmarking on https://fstpackage.github.io

Jan 12 '17 19:01 MarcusKlik

I believe the best way to approach a extensive benchmark suite would be to publish code to generate a number of datasets that differ in their characteristics. Each code snipped should generate a single column dataset. Examples of datasets that differ significantly in the resulting serialization speed:

random integer column
sequential integer column (e.g. 1:1000000)
integer column with many NA values
integer column in limited range (e.g. all values between -100 and 100)
only positive random integers
double's generated with runif
double related to monetary values (e.g. generated with sample(1:x, n) / 100F
double column that only has limited number of distinct values
character column with limited number of distinct values (e.g. 'TRUE', 'FALSE', 'NA')
character column with short / medium / long strings
character vector with special UTF8 characters
logicals with 90 percent TRUE and 10 percent FALSE
random logicals

and many more. Performance varies a lot between all of these types. When measuring the compression and serialization performance of particular software the first questions that should be answered is 'what data are you actually compressing / serializing ?'. Standard 'text-oriented 'benchmark datasets like the Silesia compression corpus are not very relevant to data science and would not accurately depict performance of packages like fst, feather or data.table (fread / fwrite).

May 29 '17 21:05 MarcusKlik

I will definitely use fst to constructed the benchmark suite that I am building for Julia!!

https://github.com/xiaodaigh/data_manipulation_benchmarks

Oct 12 '17 03:10 xiaodaigh

Hi @xiaodaigh, it would be great to have a fst port for Julia (and Python). The core of fst is now C++ only, so it should be straightforward to write a wrapper for other platforms. For example, I'm using a pure C++ wrapper around fst for testing purposes.

But at the moment I'm concentrating on getting the fst file-format stable and ready for future expansions (like data-hashes, key tables and row- and column binding) , so I won't be spending time on ports to other languages just yet.

Your benchmark suite looks very interesting, it would be great to have fst compared with other (cross language) packages. If you need any help with that please let me know!

Oct 12 '17 21:10 MarcusKlik

fst fst copied to clipboard

Benchmark on well-known data sets

fst
fst copied to clipboard