pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Extending pyjanitor to support other data-handling frameworks

Open zbarry opened this issue 6 years ago • 3 comments

Potential candidates include (but would be in no way limited to):

  • xarray
  • vaex
  • Dask

The idea here is that data cleaning and higher-level functions to manipulate data structures are pretty universal endeavors in any kind of analysis. A clean API could be very beneficial for many packages.

XArray has a pandas-flavor-like accessor interface already (and is used by packages such as hvPlot):

http://xarray.pydata.org/en/stable/generated/xarray.register_dataset_accessor.html

This is probably needing of a little more discussion especially as related to DRY and modularizability of pyjanitor.

zbarry avatar Jul 12 '19 19:07 zbarry

This is definitely a great idea, but we'd really have to see if all the functions would be applicable to parallel computing situations. That's the reason not every Pandas function is implemented in Dask dataframes. For example, median is difficult to compute when split across machines.

Nothing immediately comes to mind as being problematic in pyjanitor, but the functions still should be considered how to be performed in a distributed manner.

szuckerman avatar Jul 12 '19 20:07 szuckerman

Very much agreed for the parallel stuff.

zbarry avatar Jul 12 '19 22:07 zbarry

This is how hvplot modifies Dask dataframes:

https://github.com/pyviz/hvplot/blob/master/hvplot/dask.py

Looks like it’s as simple as a monkey patch to add the .hvplot attribute, as far as I can tell.

zbarry avatar Oct 20 '19 20:10 zbarry