[ENH] Extending pyjanitor to support other data-handling frameworks
Potential candidates include (but would be in no way limited to):
- xarray
- vaex
- Dask
The idea here is that data cleaning and higher-level functions to manipulate data structures are pretty universal endeavors in any kind of analysis. A clean API could be very beneficial for many packages.
XArray has a pandas-flavor-like accessor interface already (and is used by packages such as hvPlot):
http://xarray.pydata.org/en/stable/generated/xarray.register_dataset_accessor.html
This is probably needing of a little more discussion especially as related to DRY and modularizability of pyjanitor.
This is definitely a great idea, but we'd really have to see if all the functions would be applicable to parallel computing situations. That's the reason not every Pandas function is implemented in Dask dataframes. For example, median is difficult to compute when split across machines.
Nothing immediately comes to mind as being problematic in pyjanitor, but the functions still should be considered how to be performed in a distributed manner.
Very much agreed for the parallel stuff.
This is how hvplot modifies Dask dataframes:
https://github.com/pyviz/hvplot/blob/master/hvplot/dask.py
Looks like it’s as simple as a monkey patch to add the .hvplot attribute, as far as I can tell.