dataframe-api Add standard unit of measure support

I don't know if it's possible, but having a standard way to thread through unit of measures would be great.

Ideally you could implement something like pint-pandas but instead as pint-dataframe and it would interop seamlessly with all dataframe libraries.

Jul 13 '23 22:07 kszlim

I don't think pint would go into the standard itself - but hopefully the standard would enable someone to write a library-agnostic version of pint-pandas!

Jul 14 '23 14:07 MarcoGorelli

Yep, that's what I mean, it'd be good for the dataframe-api to specify a standard mechanism for transmitting unit of measure data (and/or a mechanism for transmitting metadata + a mechanism that determines how that metadata can change across operations on dfs).

Jul 14 '23 17:07 kszlim

and/or a mechanism for transmitting metadata + a mechanism that determines how that metadata can change across operations on dfs

It seems to me like this is related to gh-40, which discussed adding a way to incorporate any kind of metadata beyond what was standardized in the interchange protocol.

The transmitting or storing part is fairly clear I think. The second part of you suggestion here is less clear to me @kszlim. That seems to suggest some kind of hook that any dataframe library must call after each method it calls. That could be quite expensive to do I think, and there may be other/simpler alternatives there (if the dataframe object lives in a pint-dataframe type package, I'd expect all the methods and logic to live there too, and wrap a "base dataframe object" somehow).

Jul 20 '23 13:07 rgommers

Hmm, I see. I'm not sure how a pint-dataframe package would work, would it require wrapping every dataframe library manually or do you see a way that it could work agnostically?

I guess it's pretty hard if not impossible to make it work agnostically without defining a huge space of operations on the dataframe api itself (which I think you guys are trying to avoid?).

Jul 20 '23 18:07 kszlim

Hmm, I see. I'm not sure how a pint-dataframe package would work, would it require wrapping every dataframe library manually or do you see a way that it could work agnostically?

All "base" dataframe objects have the same API, so I imagine you could store it as a private attribute. Something like:

class PintDataFrame
    def __init__(self, base_dataframe : StandardDataFrame, units_metadata : ?) -> PintDataFrame:
        self._df = base_dataframe

    def sum(*, skip_nulls: bool = True) -> PintDataFrame:
        """Reduction returns a 1-row DataFrame."""
        result = self._df.sum(skip_nulls=skip_nulls)
        # If needed, manipulate units metadata here
        result_metadata = self.units_metadata  # or some transformation
        return PintDataFrame(result, units_metadata=result_metadata)

For all methods that don't actually change the units, I imagine there's a way to handle them in an automated/streamlined fashion. And for the ones that do, the custom logic needs to be written once and is independent of what library the base dataframe comes from.

Jul 21 '23 11:07 rgommers

I see, I guess that this will still require a bunch of custom implementations if there are operations that dont' delegate to "base" dataframe methods, but I suppose that's probably impossible to avoid altogether.

Jul 23 '23 00:07 kszlim