Add standard unit of measure support
I don't know if it's possible, but having a standard way to thread through unit of measures would be great.
Ideally you could implement something like pint-pandas but instead as pint-dataframe and it would interop seamlessly with all dataframe libraries.
I don't think pint would go into the standard itself - but hopefully the standard would enable someone to write a library-agnostic version of pint-pandas!
Yep, that's what I mean, it'd be good for the dataframe-api to specify a standard mechanism for transmitting unit of measure data (and/or a mechanism for transmitting metadata + a mechanism that determines how that metadata can change across operations on dfs).
and/or a mechanism for transmitting metadata + a mechanism that determines how that metadata can change across operations on dfs
It seems to me like this is related to gh-40, which discussed adding a way to incorporate any kind of metadata beyond what was standardized in the interchange protocol.
The transmitting or storing part is fairly clear I think. The second part of you suggestion here is less clear to me @kszlim. That seems to suggest some kind of hook that any dataframe library must call after each method it calls. That could be quite expensive to do I think, and there may be other/simpler alternatives there (if the dataframe object lives in a pint-dataframe type package, I'd expect all the methods and logic to live there too, and wrap a "base dataframe object" somehow).
Hmm, I see. I'm not sure how a pint-dataframe package would work, would it require wrapping every dataframe library manually or do you see a way that it could work agnostically?
I guess it's pretty hard if not impossible to make it work agnostically without defining a huge space of operations on the dataframe api itself (which I think you guys are trying to avoid?).
Hmm, I see. I'm not sure how a
pint-dataframepackage would work, would it require wrapping every dataframe library manually or do you see a way that it could work agnostically?
All "base" dataframe objects have the same API, so I imagine you could store it as a private attribute. Something like:
class PintDataFrame
def __init__(self, base_dataframe : StandardDataFrame, units_metadata : ?) -> PintDataFrame:
self._df = base_dataframe
def sum(*, skip_nulls: bool = True) -> PintDataFrame:
"""Reduction returns a 1-row DataFrame."""
result = self._df.sum(skip_nulls=skip_nulls)
# If needed, manipulate units metadata here
result_metadata = self.units_metadata # or some transformation
return PintDataFrame(result, units_metadata=result_metadata)
For all methods that don't actually change the units, I imagine there's a way to handle them in an automated/streamlined fashion. And for the ones that do, the custom logic needs to be written once and is independent of what library the base dataframe comes from.
I see, I guess that this will still require a bunch of custom implementations if there are operations that dont' delegate to "base" dataframe methods, but I suppose that's probably impossible to avoid altogether.