datatable
datatable copied to clipboard
big to big join timings not stable
Pydatatable join can be very fast, but in case of big to big join the variance of timing is very big. Numeric columns presents unix epoch time of the benchmark run. All timings made on 1f81e5711b77f93494fa01379d8dd242e4b45cea. 1e9 timings are on-disk, while the others are in-memory. Numbers in seconds.
in_rows question 1572674172 1573178513 1573180283
1: 1e7 small inner on int 0.253 0.237 0.195
2: 1e7 medium inner on int 0.291 0.286 0.292
3: 1e7 medium outer on int 0.099 0.105 0.100
4: 1e7 medium inner on factor 0.355 0.329 0.354
5: 1e7 big inner on int 12.246 4.596 11.247
6: 1e8 small inner on int 2.051 2.009 1.982
7: 1e8 medium inner on int 3.426 3.057 3.165
8: 1e8 medium outer on int 1.297 1.386 1.287
9: 1e8 medium inner on factor 4.132 4.226 4.226
10: 1e8 big inner on int 91.243 40.386 58.109
11: 1e9 small inner on int 35.511 36.573 36.716
12: 1e9 medium inner on int 44.874 40.499 45.474
13: 1e9 medium outer on int 15.163 15.463 16.067
14: 1e9 medium inner on factor 170.026 168.346 165.552
15: 1e9 big inner on int NA NA NA
I don't think we have to do anything about that because even when it is slower, it is still quite fast, but reporting so it is known and documented in project repo.
Hmm, looks like the biggest variation is in "big inner on int" tests (rows 5 and 10)
Yes, it is big to big join where we join table of the same size, 90% of rows are matching
Other join queries have now also very unstable timings, possibly caused by #2775.
For example q2 "medium inner on int":
On 1e9 one time 622.36, 687.774, another time 1592.488, 1306.6.
On 1e8 one time 152.617, 138.237 and another 505.987, 449.31.
Using same source (b4f78fbbb7aeee1d22b56cc33f994b7b48d23765).