db-benchmark icon indicating copy to clipboard operation
db-benchmark copied to clipboard

reshape task (pivot, unpivot)

Open jangorecki opened this issue 4 years ago • 4 comments
trafficstars

nice examples by @grantmcdermott can be found https://grantmcdermott.com/reshape-benchmarks/ https://grantmcdermott.com/even-more-reshape/

jangorecki avatar Dec 30 '20 07:12 jangorecki

Agree: I think that a reshaping benchmark is an important addition to the list. Happy to add a PR with my examples as-is if that helps? Some quick thoughts/issues:

  • I'd drop the Stata runs — I'm guessing you don't have the license reqs — leaving DT, dplyr, pandas, and DataFrames.jl implementations.
  • My examples are only wide-to-long, but easy enough to add a complement going long-to-wide (i.e. back to the original dataset).
  • My dataset is deliberately sparse (lots of missing obs). Would you want the same thing for this benchmark?

grantmcdermott avatar Dec 30 '20 20:12 grantmcdermott

@grantmcdermott Thank you for your comment. Yes, Stata needs to be dropped, we stick to open source software. No need PR, but eventually some assistance in reviewing design may be useful.

Ideally reshape task should test:

  • melt
  • dcast
  • 95%, 5%, 0% missing
  • different functions applied during dcast
  • multiple columns on id side
  • multiple columns on measure side
  • probably quite few other features (need to look at common usage patterns on SO)

All that needs to be categorized into: 5 "basic" and 5 "advanced" queries. So the scope will be way bigger than your posts, yet your posts are very useful working example to start on them.

jangorecki avatar Dec 31 '20 11:12 jangorecki

Sounds good. Lmk if and when you'd like someone to cast an extra eye over the tests.

grantmcdermott avatar Jan 07 '21 17:01 grantmcdermott

I came here to suggest we also need to benchmark reshaping times. Glad to see other thought the same.

skanskan avatar Mar 20 '21 00:03 skanskan