pymarginaleffects
pymarginaleffects copied to clipboard
`narwhals` to support pandas as input
Hi @vincentarelbundock - @juanitorduz pointed me towards @MarcoGorelli's narwhals project, which solves the "two DataFrame APIs" problem for developers by allowing to define APIs that are agnostic to the input data frame type.
I.e. one can do things as
import narwhals as nw
import pandas as pd
import polars as pl
def func(df_any):
df = nw.from_native(df_any)
df = df.select(
a_sum=nw.col('a').sum(),
a_mean=nw.col('a').mean(),
a_std=nw.col('a').std(),
)
return nw.to_native(df)
df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns pandas DataFrame
df = pl.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns polars DataFrame
In other words - your polars code will work on pandas inputs, without requiring pandas as a dependency! In fact, narwhals even allows you to drop the polars dependency if you wanted to.
Happy to try myself at a PR for this once I find the time =)
Ooooh that sounds like exactly what we need.
I won't have much (if any) time to write code for this in the next couple months, but I'd be more than happy to review a PR, especially if the changes are pretty minimal.
Thanks for raising this issue!
@artiom-matvei you could take a look at this if you have time. Seems simple enough, and may not require that much internal change.
@vincentarelbundock From what I understand, narwhals is used as an interface that translates operations to the underlying functions of polars or pandas.
Currently, I think we convert pandas to polars (through sanitize_newdata()) and then we work with polars everywhere.
The way I see we could apply narwhals, would be by using narwhals.DataFrame instead of polars.DataFrame everywhere which seems like needing quite some refactoring. Is this the way to go? Do you have a suggestion for a different way of using narwhals?
Yes, I think that's basically it.
IIUC, there wouldn't be that much refactoring to do, because the Narwhals API is extremely similar to Polars, which we already use. So it would mostly be a matter of changing:
pl.DataFrame() -> nw.DataFrame()
pl.col() -> nw.col()
I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.
If that passes the tests, we will have a better idea of what it would take to refactor everything. A proof of concept before committing many hours.
@s3alfisc do you have experience doing this? If so, how would you recommend we proceed?
I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.
This sounds like the right strategy to me! I also think that replacing the pl. suffix with nw. should get you at least to 90%.
Unfortunately, I haven't yet started the narwhals migration of pyfixest, so still somewhat inexperienced. But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?
I opened an Issue with narwhals as I cannot make it apply the scipy.stats.t.cdf function inside the pipe function. https://github.com/narwhals-dev/narwhals/issues/1225
@vincentarelbundock if you have any suggestions
Thanks for opening this issue. It looks detailed and helpful. Hopefully someone will respond with useful info.
But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?
thanks for the ping! happy to help out along the way (though I wouldn't have capacity right now to lead the effort) - i'd suggest to start small and see if there's any parts of the codebase which it's possible to narwhalify whilst keeping the test suite passing. then, gradually increase
for any questions / help, i'd suggest:
- joining the narwhals discord (link on the project's README)
- attending our community call if you'd like to chat about anything (also linked to on the README)
- i also have a calendly link if the time of the above doesn't work for you
happy to see what we can do here!
marginaleffects version 0.0.14 now uses Narwhals to ingest data frames in many different formats. We still load Polars and do all internal transformations with it, but wrote a single Narwhals function to convert user-supplied data frames on entry.
Thanks to every who contributed to the discussion, and especially to Artiom for the implementation!