db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

new task: read

Open jangorecki opened this issue 5 years ago • 2 comments
trafficstars

Reading data benchmark is on the roadmap. It should cover:

  • reading csv most portable tabular data format, to cover transferring data between different solutions
  • reading a binary formats most solution-specific formats, to cover transferring data within the same solution
  • data of numeric fields only (integer and floats)
  • data of 50% categorical fields (integer, floats and categorical)
  • character fields
  • date, time and datetime fields

ideas for testing particular features (maybe advanced questions?)

  • top N rows
  • particular rows
  • particular columns
  • ?

feedback welcome

jangorecki avatar Feb 01 '20 09:02 jangorecki

I collected some feedback about this task from our internal discussion.

Initially I will focus only on reading csv, not a binary formats.

For real world data NYT will be good first case, we should probably find one more popular dataset, to have two real world data.

For simulated data:

  • shape: long, wide, long and wide (fixed rows*cols?)
  • types separately (3 columns of each type): int, double, char, factor, date, datetime
  • types mixed (one columns of each type)
  • cardinality (count of unq values)

jangorecki avatar Feb 11 '21 10:02 jangorecki

relevant issue https://github.com/Rdatatable/data.table/issues/2634

MichaelChirico avatar Mar 06 '21 11:03 MichaelChirico