Extension mechanism to support quantities (pint, astropy, ...)
Is your feature request related to a problem? Please describe. I'd like to be able to use quantities (astropy, pint) in my dataframe in an efficient manner.
Describe the solution you'd like
I'd like to be able to extend Series to give it context information and to add custom behavior.
Here is an exemple of what I would like to do :
Quantities can be represented as a value (int, float) with a unit (Hz, meter, ...). Instead of having Series containing many quantities with same unit, it could be interesting to have Series containing only values, with the unit as metadata. I'll refer to such a series as a QuantitySeries from now.
In case operations are applied on the elements of the same QuantitySeries, there is no problem as they share the same unit. Operations can be done directly on values.
However, with an operation op working on QuantitySeries qs and a Series s, there are some stuff to do :
- If
sis not aQuantitySeries, just doop(qs, s), using the values ofqs. - If
sis aQuantitySeries:- Check if
s.unitis compatible withqs.unit, or raise an error. - Create a temporary
QuantitySeriess'fromsusing the same unit thanqs. - Apply
op(s', s).
- Check if
I don't know if implementing such a behavior directly from python would be possible, or if it's the way to go. I'm interested in this feature because of quantities, but I think this extension mechanism can be interesting for other use cases as well.
Describe alternatives you've considered I considered using :
However, none of these scales on my workload.
Hi @valgai! This is really interesting.
Daft doesn't currently have any concept of quantities, and it might be tricky to implement this in core Daft as a first-class concept.
However, each Field in the Schema actually does have metadata (which I think we don't currently expose in our Python API yet I think). It could be interesting to try and see how we could leverage that metadata to maybe do some manipulations of our expression tree:
col("weight_grams") * col("joules_per_kilograms") ==> (col("weight_grams") / 1000) * col("joules_per_kilograms")
In order to do this, Daft needs to understand the quantities and transformations required to modify the expression tree to match the intended effect. Definitely seems a little tricky, and maybe a little too magical though 😛