pyjanitor icon indicating copy to clipboard operation
pyjanitor copied to clipboard

[ENH] Investigate potential integration with modin project

Open anzelpwj opened this issue 5 years ago • 7 comments

Brief Description

I would like to investigate whether janitor would work well with modin as a backend.

Modin is a (currently Linux/OSX only) replacement for Pandas that purports to be much faster. It has most, but not all, of the pandas API, but for what it has it should be a one-for-one replacement.

This would be just some investigative work to see if I could even get this to work in the first place.

anzelpwj avatar Aug 20 '19 21:08 anzelpwj

Worthwhile to point this to: #394

Also: https://modin.readthedocs.io/en/latest/architecture.html

This may or may not introduce a complication:

Currently, each partition’s memory format is a pandas DataFrame. In the future, we will support additional in-memory formats for the backend, namely Arrow tables.

zbarry avatar Oct 12 '19 15:10 zbarry

Hey @anzelpwj! Greetings from chilly Cambridge :smile:. Wanted to know where you’re at with this issue? I’m just reviewing old issues from the issue tracker at the moment.

ericmjl avatar Dec 21 '19 00:12 ericmjl

Apologies @ericmjl - kid started daycare late August and we've been dealing with the daycare plague and other sundries for the past several months. I'd like to get started again contributing to PyJanitor (was inspired by going to PyDataLA and doing a tiny sprint there), but would definitely understand if someone else is raring to take this ticket. Should hopefully have some time clear up after next week though.

anzelpwj avatar Dec 21 '19 02:12 anzelpwj

Okay, so I've done a review of modin documentation, with a particular eye to which methods they state don't have full support link. None of our code uses methods that are complete unsupported, which is good. This leaves two other categories:

  • Code that "defaults to Pandas" (modin dataframe is converted to pandas dataframe, routine is run, dataframe is converted back to modin. Code will work, but there will be some overhead).
  • APIs with partial implementation.

As for the DataFrame APIs we call:

Partial support:

  • append: "Not fully optimized" (so probably not a concern).
  • fillna: If a dataframe is given for the fill value, this becomes a default-to-pandas routine (unlikely to be a big problem).
  • query: Local variables not supported. As best I can tell, since query is kind of an eval function, it means that if we define a variable in the string it's an issue. Maybe we might want to throw up a warning for people running filter_on?

Defaults to pandas:

  • assign
  • combine_first
  • drop_duplicates
  • duplicated
  • to_csv (and only run in a few tests anyway...)

The biggest difficulty I'm having is seeing if we can create an equivalent to pandas_flavor for modin. I may ping the modin project folks on this front...

anzelpwj avatar Jan 05 '20 03:01 anzelpwj

Okay, sent them a message to their listserv, will see what I hear back.

anzelpwj avatar Jan 05 '20 03:01 anzelpwj

Got a response!

We do have ways to create dataframe methods, but it's in early stages and still kind of hacky. There's a tutorial here: https://github.com/ucbrise/risecamp/blob/risecamp2019/modin/tutorial_notebooks/exercise_3.ipynb. Let me know if it works for you.

Not sure if it's going to be easier to try and create a pandas/pandas-flavor like fix for Modin or create something on our end now.

anzelpwj avatar Jan 06 '20 03:01 anzelpwj

Hi everyone, is this issue still in demand - or it became less relevant for whatever reasons? I am trying to understand a) does it make sense to play with modin b) does it make sense to target pyjanitor in the process of playing with modin

asmirnov69 avatar Oct 06 '22 06:10 asmirnov69