db-benchmark new task: read

new task: read

Open jangorecki opened this issue 5 years ago • 2 comments

trafficstars

Reading data benchmark is on the roadmap. It should cover:

reading csv most portable tabular data format, to cover transferring data between different solutions
reading a binary formats most solution-specific formats, to cover transferring data within the same solution
data of numeric fields only (integer and floats)
data of 50% categorical fields (integer, floats and categorical)
character fields
date, time and datetime fields

ideas for testing particular features (maybe advanced questions?)

feedback welcome

Feb 01 '20 09:02 jangorecki

I collected some feedback about this task from our internal discussion.

Initially I will focus only on reading csv, not a binary formats.

For real world data NYT will be good first case, we should probably find one more popular dataset, to have two real world data.

For simulated data:

shape: long, wide, long and wide (fixed rows*cols?)
types separately (3 columns of each type): int, double, char, factor, date, datetime
types mixed (one columns of each type)
cardinality (count of unq values)

Feb 11 '21 10:02 jangorecki

relevant issue https://github.com/Rdatatable/data.table/issues/2634

Mar 06 '21 11:03 MichaelChirico