arkouda
arkouda copied to clipboard
Implement functions required by h2oai db-benchmark
The database-like ops benchmark compares the performance of a dozen or so frameworks on common EDA tasks, all involving groupby or join. Because of its emphasis on groupby/join, this is the most realistic data science benchmark I've seen. Unfortunately, they only run on a single node (40 cores, 128 GB), which is not a configuration that plays to arkouda's strengths.
Still, I think it would be good to support the benchmark operations in arkouda (using pandas syntax where possible). Based on some quick measurements I took, I believe arkouda would perform respectably, if it can stay under the memory ceiling. In any case, running this benchmark ourselves would at least suggest areas for us to improve performance and/or memory footprint. Eventually, we could probably submit a very competitive entry, even on a single node.
Even more importantly, I believe extending these benchmarks to a multi-node configuration would be a good basis for comparing arkouda against other distributed-memory frameworks like spark, at scales where a single node would not suffice. This would, of course, be right in arkouda's wheelhouse.
Operations needed to support the benchmark (each should be a separate issue):
- [ ] #3328
- e.g.
{col1: agg1, col2: [agg2a, agg2b], ...}
. See pandas usage. Only valid withdf.groupby()
result.
- e.g.
- [x] #1787
- [x] #1778
- [ ]
DataFrame.assign()
- takes keyword args, where each arg is a function that takes the dataframe as input and returns a new column, and the name of the keyword arg becomes the name of the new column in the dataframe.
- [ ]
GroupBy.head(n)
- takes the first
n
rows from each group in the original dataframe. Only valid withdf.groupby()
result.
- takes the first
- [x]
Series.isna()
- [ ] Pearson correlation of two columns (not sure what syntax we want for this; see advanced Query 4)
- [x] https://github.com/Bears-R-Us/arkouda/issues/2716
-
DataFrame.merge(other, on='col', how='left|inner')
- could wrap
ak.inner_join()
andak.lookup
.
-
- [ ] Bonus:
GroupBy.maxk(x, k)
and related methods (not strictly necessary, but would greatly simplify the syntax for one of the queries)
Pandas syntax for each query can be found in the benchmark source code. Feel free to add anything I missed.