db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

advanced questions for `join` tests

Open jangorecki opened this issue 7 years ago • 5 comments

Presently join tests are made on 2 integer columns tables, equal size, inner join on single column. It is because it was difficult to achieve good random numbers for 1e10 datasets used before. Now we won't go beyond 1e9 so we can easily use another set of data. Based on the questions we want to answer in this tests we will pick/generate expected datasets. Initial list of queries we might want to test listed below. We need to chose those which we want to have included in first iteration, rest will be left for future extensions. My picks are as follows.

Types of queries:

  • [x] left/right outer
  • [x] inner
  • [ ] full outer
  • [x] join on multiple columns
  • [ ] non-equi join
  • [x] lookup (update on join #24)
  • [ ] lookup from multiple tables
  • [ ] row explosion join (some multiple matches, partial cross join)
  • [ ] cross join (full cartesian product)
  • [ ] temporal join

Types of fields:

  • [x] join on integer column
  • [x] join on factor column(s) #21

Sizes of datasets:

  • [x] big to big (1e9-1e9)
  • [x] big to medium (1e9-1e6)
  • [x] big to small (1e9-1e3)

Using different datasets will heavily complicate presenting benchmarks results (as this is another dimension to present on report). We can think how to overcome that. Also we need to wisely choose subset of queries/fields/sizes as my current selection 4*2*3 gives 24 different questions, this multiply by 3 (1e7, 1e8, 1e9) and we have 72 tests. While current groupby tests has only 5*3 = 15 tests. @mattdowle

jangorecki avatar Aug 07 '18 06:08 jangorecki

We will also need a join on multiple columns (similar to multi-column group and sort).

st-pasha avatar Oct 24 '18 17:10 st-pasha

I pushed draft of join questions. Data is 3 id factor (2 unique, 1 dups), 3 id int (2 unique, 1 dups), 1 double. The list of initially discussed on H2O World:

  • [x] inner, singlecol, integer, big-big
  • [x] inner, singlecol, integer, big-medium
  • [x] inner, singlecol, integer, big-small
  • [x] outer, singlecol, integer, big-medium
  • [x] inner, singlecol, factor, big-medium
  • [x] inner, multicol, integer, big-medium
  • [x] inner, singlecol, integer, big-medium, update on join

The list did not covered the cardinality/duplicates. At the current moment all fields used in join have no duplicates. We should consider adding questions for joining on fields that contains duplicates. Data is ready for that. https://github.com/h2oai/db-benchmark/blob/00c8ae2bd8be75ab767fc315f92b2230e36cb62f/join-datagen.R#L26-L34

jangorecki avatar Feb 19 '19 13:02 jangorecki

From the 7 questions proposed above, 5 are going to be categorised as basic, testing mostly scalability, the rest plus 3 extra will be categorised advanced, testing features. For consistency with groupby task, and plotting results with benchplot.

# basic
join to small inner on int
join to medium inner on int
join to medium outer on int
join to medium inner on factor
join to big inner on int
# advanced
join to medium inner on int int
join to medium update on int
join to medium aggregate on int
join to medium rolling on int
something well stressing (row explosion join? non-equi join?)

jangorecki avatar May 03 '19 10:05 jangorecki

note to fix chk produced by spark, juliadf and maybe others. as of now they produce chk having 0 so answers-validation.R script solution_chk check is failing. Workaround has been introduced in https://github.com/h2oai/db-benchmark/blob/39fee2ffdd1ec6efcd3dc0f8ca375b61855f0979/report.R#L51 should be removed when chk amended.

jangorecki avatar Aug 21 '19 10:08 jangorecki

join task for 5 basic questions has been implemented. design of datasets for join is explained in https://github.com/h2oai/db-benchmark/issues/106 as of now join task was not yet added only for clickhouse. remaining items in scope of this issue:

  • [x] add clickhouse #137
  • [x] where possible (spark, pydatatable, dask?) and necessary (1e9) solution could use on-disk data
  • [ ] add 5 advanced questions
join,medium inner on int int,advanced
join,medium update on int,advanced
join,medium aggregate on int,advanced
join,medium rolling on int,advanced
join,big non-equi aggregate on int int int,advanced

jangorecki avatar Oct 31 '19 06:10 jangorecki