Daft icon indicating copy to clipboard operation
Daft copied to clipboard

Extension mechanism to support quantities (pint, astropy, ...)

Open valgai opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe. I'd like to be able to use quantities (astropy, pint) in my dataframe in an efficient manner.

Describe the solution you'd like I'd like to be able to extend Series to give it context information and to add custom behavior.

Here is an exemple of what I would like to do :

Quantities can be represented as a value (int, float) with a unit (Hz, meter, ...). Instead of having Series containing many quantities with same unit, it could be interesting to have Series containing only values, with the unit as metadata. I'll refer to such a series as a QuantitySeries from now.

In case operations are applied on the elements of the same QuantitySeries, there is no problem as they share the same unit. Operations can be done directly on values.

However, with an operation op working on QuantitySeries qs and a Series s, there are some stuff to do :

  • If s is not a QuantitySeries, just do op(qs, s), using the values of qs.
  • If s is a QuantitySeries:
    • Check if s.unit is compatible with qs.unit, or raise an error.
    • Create a temporary QuantitySeries s' from s using the same unit than qs.
    • Apply op(s', s).

I don't know if implementing such a behavior directly from python would be possible, or if it's the way to go. I'm interested in this feature because of quantities, but I think this extension mechanism can be interesting for other use cases as well.

Describe alternatives you've considered I considered using :

However, none of these scales on my workload.

valgai avatar Mar 13 '24 13:03 valgai

Hi @valgai! This is really interesting.

Daft doesn't currently have any concept of quantities, and it might be tricky to implement this in core Daft as a first-class concept.

However, each Field in the Schema actually does have metadata (which I think we don't currently expose in our Python API yet I think). It could be interesting to try and see how we could leverage that metadata to maybe do some manipulations of our expression tree:

col("weight_grams") * col("joules_per_kilograms") ==> (col("weight_grams") / 1000) * col("joules_per_kilograms")

In order to do this, Daft needs to understand the quantities and transformations required to modify the expression tree to match the intended effect. Definitely seems a little tricky, and maybe a little too magical though 😛

jaychia avatar Mar 20 '24 19:03 jaychia