dataframe-api icon indicating copy to clipboard operation
dataframe-api copied to clipboard

Relevant dataframe libraries

Open rgommers opened this issue 5 years ago • 7 comments
trafficstars

This issue is meant to collect libraries that we should be aware of and perhaps take into account (data on how their API looks, impact of choices on those libraries, etc.).

See https://github.com/pydata-apis/array-api/issues/3 for relevant array libraries.

rgommers avatar May 21 '20 10:05 rgommers

Added mars and staticframe.

TomAugspurger avatar May 26 '20 20:05 TomAugspurger

Added dexplo and datatable.

dexplo is an interesting one because it's a minimalist design and already adheres to some of the API requirements we've discussed, such as requiring column labels to be strings and unique within a given DataFrame.

datatable is aiming to be a Python implementation of the R data.table library.

jack-pappas avatar Jun 16 '20 16:06 jack-pappas

@rgommers, I think the libraries that would be worth comparing the methods they implement are:

  • pandas
  • Dask
  • cuDF
  • Modin
  • Vaex
  • Koalas
  • Mars
  • dexplo
  • Eland

datapythonista avatar Jul 10 '20 09:07 datapythonista

@datapythonista thanks. Could you add some rationale? Why are Mars, dexplo and Eland interesting and some of the other listed libraries not? I think they're all quite small, and at least dexplo and eland seem to be very young with almost no usage and few contributors. So I'd think the main focus should be on the first six libraries in your list?

rgommers avatar Jul 10 '20 14:07 rgommers

That's a good point. What I had in mind was to have a comparison of what developers of libraries that copy the pandas API implemented. So, I excluded the ones that don't aim to have a pandas-like API, and didn't consider their popularity.

Not sure if the outcome will tell more about how important the developers considered a feature is, or how easy to implement it was. But since I expect all the libraries in the list to use the same naming as pandas, I think the comparison should be easy to generate.

For Eland, since it's backed by Elastic, there are some things that I would expect to be missing. If we consider that the dataframe API could be used for Ibis-like projects, backed by databases, then there could be some valuable information there.

But in any case, not a problem at all to leave the last three out. I see value of having them if it's not too much effort to extract their APIs, but with the others is surely good enough.

datapythonista avatar Jul 10 '20 14:07 datapythonista

scipp, which is conceptually most similar to xarray (with some extra features).

SimonHeybrock avatar Sep 06 '21 13:09 SimonHeybrock

Polars: https://github.com/pola-rs/polars

kkraus14 avatar Sep 06 '21 22:09 kkraus14