pymarginaleffects icon indicating copy to clipboard operation
pymarginaleffects copied to clipboard

`narwhals` to support pandas as input

Open s3alfisc opened this issue 1 year ago • 2 comments

Hi @vincentarelbundock - @juanitorduz pointed me towards @MarcoGorelli's narwhals project, which solves the "two DataFrame APIs" problem for developers by allowing to define APIs that are agnostic to the input data frame type.

I.e. one can do things as

import narwhals as nw
import pandas as pd 
import polars as pl

def func(df_any):
    df = nw.from_native(df_any)
    df = df.select(
        a_sum=nw.col('a').sum(),
        a_mean=nw.col('a').mean(),
        a_std=nw.col('a').std(),
    )
    return nw.to_native(df)

df = pd.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns pandas DataFrame

df = pl.DataFrame({'a': [1, 2, 3, 4, 5]})
func(df) # returns polars DataFrame

In other words - your polars code will work on pandas inputs, without requiring pandas as a dependency! In fact, narwhals even allows you to drop the polars dependency if you wanted to.

Happy to try myself at a PR for this once I find the time =)

s3alfisc avatar Jul 02 '24 19:07 s3alfisc

Ooooh that sounds like exactly what we need.

I won't have much (if any) time to write code for this in the next couple months, but I'd be more than happy to review a PR, especially if the changes are pretty minimal.

Thanks for raising this issue!

vincentarelbundock avatar Jul 02 '24 19:07 vincentarelbundock

@artiom-matvei you could take a look at this if you have time. Seems simple enough, and may not require that much internal change.

vincentarelbundock avatar Oct 09 '24 21:10 vincentarelbundock

@vincentarelbundock From what I understand, narwhals is used as an interface that translates operations to the underlying functions of polars or pandas.

Currently, I think we convert pandas to polars (through sanitize_newdata()) and then we work with polars everywhere.

The way I see we could apply narwhals, would be by using narwhals.DataFrame instead of polars.DataFrame everywhere which seems like needing quite some refactoring. Is this the way to go? Do you have a suggestion for a different way of using narwhals?

artiom-matvei avatar Oct 18 '24 17:10 artiom-matvei

Yes, I think that's basically it.

IIUC, there wouldn't be that much refactoring to do, because the Narwhals API is extremely similar to Polars, which we already use. So it would mostly be a matter of changing:

pl.DataFrame() -> nw.DataFrame()
pl.col() -> nw.col()

I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.

If that passes the tests, we will have a better idea of what it would take to refactor everything. A proof of concept before committing many hours.

@s3alfisc do you have experience doing this? If so, how would you recommend we proceed?

vincentarelbundock avatar Oct 18 '24 17:10 vincentarelbundock

I think what I'd like to see is a proof of concept on a very small part of the code base. Converting a couple functions to accept any kind of data frame as input, and returning the original type.

This sounds like the right strategy to me! I also think that replacing the pl. suffix with nw. should get you at least to 90%.

Unfortunately, I haven't yet started the narwhals migration of pyfixest, so still somewhat inexperienced. But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?

s3alfisc avatar Oct 18 '24 21:10 s3alfisc

I opened an Issue with narwhals as I cannot make it apply the scipy.stats.t.cdf function inside the pipe function. https://github.com/narwhals-dev/narwhals/issues/1225 @vincentarelbundock if you have any suggestions

artiom-matvei avatar Oct 18 '24 23:10 artiom-matvei

Thanks for opening this issue. It looks detailed and helpful. Hopefully someone will respond with useful info.

vincentarelbundock avatar Oct 18 '24 23:10 vincentarelbundock

But I think that @MarcoGorelli has by now accompanied many such migrations (for example for altair) so he might have some tips / insights?

thanks for the ping! happy to help out along the way (though I wouldn't have capacity right now to lead the effort) - i'd suggest to start small and see if there's any parts of the codebase which it's possible to narwhalify whilst keeping the test suite passing. then, gradually increase

for any questions / help, i'd suggest:

  • joining the narwhals discord (link on the project's README)
  • attending our community call if you'd like to chat about anything (also linked to on the README)
  • i also have a calendly link if the time of the above doesn't work for you

happy to see what we can do here!

MarcoGorelli avatar Oct 19 '24 20:10 MarcoGorelli

marginaleffects version 0.0.14 now uses Narwhals to ingest data frames in many different formats. We still load Polars and do all internal transformations with it, but wrote a single Narwhals function to convert user-supplied data frames on entry.

Thanks to every who contributed to the discussion, and especially to Artiom for the implementation!

vincentarelbundock avatar Oct 27 '24 11:10 vincentarelbundock