formulaic icon indicating copy to clipboard operation
formulaic copied to clipboard

Support polars

Open deanm0000 opened this issue 1 year ago • 5 comments

Polars is a (relatively) new dataframe library that is gaining more popularity and blows pandas away in performance using arrow memory in the backend.

deanm0000 avatar Oct 20 '23 19:10 deanm0000

Hi @deanm0000 !

Thanks for the suggestion. There's a way to trivially implement support (i.e. how we currently implement support for pyarrow Tables) (by converting to pandas); or a more complicated integration that fully adds support for polars arrays everywhere; perhaps via just using Arrow arrays. The framework itself doesn't care about he datatypes, but some of the transforms do... and that will be the bulk of the work.

Of course, to get the performance benefits, converting everything to pandas defeats the purpose.

Do you have any instances where you are performance bottle-necked? Or is this more just a quality of life feature request?

matthewwardrop avatar Nov 03 '23 18:11 matthewwardrop

I guess, in those terms, it's a quality of life improvement. From a pure usability perspective it isn't hard to convert to pandas. I didn't realize that the pyarrow input just converted to pandas under the hood. I poked around really quickly and I couldn't find where in the code the transformations happen. Could you point me to that, like if I did Y~X+I(X^2).

deanm0000 avatar Nov 03 '23 21:11 deanm0000

The lazy arrow -> pandas conversion happens here: https://github.com/matthewwardrop/formulaic/blob/main/formulaic/materializers/arrow.py . In practice, under the hood, the data sometimes can pass through uncopied through this transaction, but then compute is done in numpy arrays or pandas Series depending on the transform. Again, the framework is datatype agnostic, so it is happy with other types... but we'd need to go through and update the transforms (like contrast encodings) to make sure they have implementations for these types.

matthewwardrop avatar Nov 03 '23 22:11 matthewwardrop

Maybe on thing to consider here is the effort to come with a DataFrame API: https://data-apis.org/dataframe-api/draft/

It could be handy to write DataFrame agnostic code.

glemaitre avatar Nov 22 '23 10:11 glemaitre

Hi @matthewwardrop - would you be open to using Narwhals for this? Altair recently adopted it for this purpose https://github.com/vega/altair/pull/3452, as did scikit-lego

Happy to put up a POC if you'd be interested (just checking first!)

MarcoGorelli avatar Jul 17 '24 09:07 MarcoGorelli