Modin integration
Is your feature request related to a problem? Please describe. Modin - https://github.com/modin-project/modin - also enables scaling pandas computation. Since we have ray, dask, and koalas, why not add Modin?
Describe the solution you'd like Modin requires a replacement of the pandas import in user code to work. We would need to think how to do this:
- Do we get people to import "pandas" from hamilton, and we can then control which pandas is actually imported?
- Do we require users then to assume modin, by changing the pandas import themselves when defining their hamilton python functions?
- Or is there some other way to integrate? E.g. a graph adapter
Additional context N/A
Hi @skrawcz, I'm happy to help support this idea.
- Would (1) require Modin to be a hard dependency? Would your users want that type of behavior from hamilton?
- 2 seems much more natural to me. Modin users can be hamilton users in that case, right?
What is required from the Modin side for an integration?
@devin-petersohn
In my ideal world, we could we enable people to maintain https://github.com/stitchfix/hamilton/blob/main/examples/hello_world/my_functions.py without touching it.
Reasoning:
- that code is reusable in different contexts very easily.
- if someone wanted to switch to dask, ray, koalas, they "technically" wouldn't have to change that code. Maybe a platform team wants to control this, and the user shouldn't need to?
idea 1
Add this to the top of the file, if people want
from hamilton.augment import pandas as pd
And if modin was detected as being installed, it would be returned, otherwise vanilla pandas would be.
idea 2
We would require users to "hard code" modin -- i.e. replace the pandas import in https://github.com/stitchfix/hamilton/blob/main/examples/hello_world/my_functions.py with modin.
Or do some if else based on an environment variable, or something like that.
Essentially like idea (1) -- but the user controls it. The downside is that they have to know about modin.
idea 3
Is there some other python duck typing way (this might be considered hacky), that "driver" or "graph adapter" code would own?
Either one could be OK I think.
That said, I want to try something like the following, in the driver:
sys.modules['pandas'] = modin.pandas
So long as its the first thing executed (big IF, but it should be), then this should work...
Either one could be OK I think.
That said, I want to try something like the following, in the driver:
sys.modules['pandas'] = modin.pandasSo long as its the first thing executed (big IF, but it should be), then this should work...
Hmm, that could work. Will have to prototype and see how the ergonomics feel.
And if
modinwas detected as being installed, it would be returned, otherwise vanilla pandas would be.
Just putting it in here, that we could use https://docs.python.org/3/library/importlib.html#checking-if-a-module-can-be-imported or something like that to check if modin is installed.
Thanks @skrawcz and @elijahbenizzy! Is there some way we can help to support this on the Modin side?
Typically the approach I have taken with Modin is that the choice should be the users and that users should be aware that Modin is being used. We make it easy to not only move to Modin, but also back to pandas from Modin if you choose.
Replacing sys.modules could work, but there may be some considerations here:
- When would you change
sys.modules?- If it happens on import, and the user doesn't end up using Hamilton after import, they may still end up using Modin thinking it is pandas (weird edge case)
- Would users be able to opt-out of this (or opt-in)?
- If Modin had a configuration that did this for you, could that help?
Thanks @devin-petersohn comments inline:
When would you change
sys.modules?
- If it happens on import, and the user doesn't end up using Hamilton after import, they may still end up using Modin thinking it is pandas (weird edge case)
It wouldn't happen on import of hamilton no. It would be as part of a script/flow of execution:
from hamilton import driver, switch_modin_for_pandas, switch_pandas_for_modin
# do switch here
switch_modin_for_pandas()
# have to import after doing the switch
import func_module
dr = driver.Driver({}, func_module)
df = dr.execute(['a', 'b', ...])
# switch it back
switch_pandas_for_modin()
save_df(df)
Would users be able to opt-out of this (or opt-in)?
The idea is that they opt-in to this, but ideally they don't have to change any of their logic to do so.
If Modin had a configuration that did this for you, could that help?
Potentially. It would enable us to depend on it, rather than having to write and maintain it ourselves.
Ok this should actually be pretty straightforward as an opt-in utility. I think we can probably start with a hacky approach in Hamilton and then merge it into Modin as it matures as a configuration (i.e. modin.config.OverrideAllPandasCalls.enable()). I like the idea of the config, but let's make sure it solves the use case first is the thinking here. Would that process make sense to you @skrawcz and @elijahbenizzy?
@devin-petersohn sounds good.
Do you happen by chance to have that hacky incantation?
@devin-petersohn okay so we can't play with sys modules, because modin still requires access to pandas. So you'd have to provide a means to do it. Created https://github.com/modin-project/modin/issues/4488 to track.
Otherwise I am going to prototype idea 1 and see how that feels.
We are moving repositories! Please see the new version of this issue at https://github.com/DAGWorks-Inc/hamilton/issues/22. Also, please give us a star/update any of your internal links.