db-benchmark
db-benchmark copied to clipboard
new task: read
trafficstars
Reading data benchmark is on the roadmap. It should cover:
- reading csv most portable tabular data format, to cover transferring data between different solutions
- reading a binary formats most solution-specific formats, to cover transferring data within the same solution
- data of numeric fields only (integer and floats)
- data of 50% categorical fields (integer, floats and categorical)
- character fields
- date, time and datetime fields
ideas for testing particular features (maybe advanced questions?)
- top N rows
- particular rows
- particular columns
- ?
feedback welcome
I collected some feedback about this task from our internal discussion.
Initially I will focus only on reading csv, not a binary formats.
For real world data NYT will be good first case, we should probably find one more popular dataset, to have two real world data.
For simulated data:
- shape: long, wide, long and wide (fixed rows*cols?)
- types separately (3 columns of each type): int, double, char, factor, date, datetime
- types mixed (one columns of each type)
- cardinality (count of unq values)
relevant issue https://github.com/Rdatatable/data.table/issues/2634