pyjanitor
pyjanitor copied to clipboard
[ENH] Investigate potential integration with modin project
Brief Description
I would like to investigate whether janitor would work well with modin as a backend.
Modin is a (currently Linux/OSX only) replacement for Pandas that purports to be much faster. It has most, but not all, of the pandas API, but for what it has it should be a one-for-one replacement.
This would be just some investigative work to see if I could even get this to work in the first place.
Worthwhile to point this to: #394
Also: https://modin.readthedocs.io/en/latest/architecture.html
This may or may not introduce a complication:
Currently, each partition’s memory format is a pandas DataFrame. In the future, we will support additional in-memory formats for the backend, namely Arrow tables.
Hey @anzelpwj! Greetings from chilly Cambridge :smile:. Wanted to know where you’re at with this issue? I’m just reviewing old issues from the issue tracker at the moment.
Apologies @ericmjl - kid started daycare late August and we've been dealing with the daycare plague and other sundries for the past several months. I'd like to get started again contributing to PyJanitor (was inspired by going to PyDataLA and doing a tiny sprint there), but would definitely understand if someone else is raring to take this ticket. Should hopefully have some time clear up after next week though.
Okay, so I've done a review of modin documentation, with a particular eye to which methods they state don't have full support link. None of our code uses methods that are complete unsupported, which is good. This leaves two other categories:
- Code that "defaults to Pandas" (modin dataframe is converted to pandas dataframe, routine is run, dataframe is converted back to modin. Code will work, but there will be some overhead).
- APIs with partial implementation.
As for the DataFrame APIs we call:
Partial support:
-
append
: "Not fully optimized" (so probably not a concern). -
fillna
: If a dataframe is given for the fill value, this becomes a default-to-pandas routine (unlikely to be a big problem). -
query
: Local variables not supported. As best I can tell, sincequery
is kind of aneval
function, it means that if we define a variable in the string it's an issue. Maybe we might want to throw up a warning for people runningfilter_on
?
Defaults to pandas:
-
assign
-
combine_first
-
drop_duplicates
-
duplicated
-
to_csv
(and only run in a few tests anyway...)
The biggest difficulty I'm having is seeing if we can create an equivalent to pandas_flavor
for modin. I may ping the modin project folks on this front...
Okay, sent them a message to their listserv, will see what I hear back.
Got a response!
We do have ways to create dataframe methods, but it's in early stages and still kind of hacky. There's a tutorial here: https://github.com/ucbrise/risecamp/blob/risecamp2019/modin/tutorial_notebooks/exercise_3.ipynb. Let me know if it works for you.
Not sure if it's going to be easier to try and create a pandas/pandas-flavor like fix for Modin or create something on our end now.
Hi everyone, is this issue still in demand - or it became less relevant for whatever reasons? I am trying to understand a) does it make sense to play with modin b) does it make sense to target pyjanitor in the process of playing with modin