datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: Support decimal for variance

Open kumarUjjawal opened this issue 1 month ago • 6 comments

Which issue does this PR close?

  • Part of #1545.

Rationale for this change

What changes are included in this PR?

  • widen the variance UDF signature to allow decimal arguments while keeping Float64 outputs, and add unit coverage that feeds decimal batches through both population and sample accumulators
  • extend aggregate.slt with decimal variance queries and align the expected strings with the Float64 results emitted by the engine

Are these changes tested?

Are there any user-facing changes?

kumarUjjawal avatar Nov 26 '25 05:11 kumarUjjawal

I don't think the changes are doing what you expect here; I believe decimals are still being converted to float for these calculations. What we are looking to do is to perform the calculations natively on decimals without casting them to floats. The function already supports decimal inputs by casting them to floats.

Thanks for pointing that. I just checked out the PR #1408 for the guidance and updated the code. I also used LLM for this.(disclosure)

kumarUjjawal avatar Nov 26 '25 09:11 kumarUjjawal

One of the key requirements that we want here is that decimal inputs are not cast to floats, and the return type remains decimal. I think if you check the tests introduced here you'll see that is not the case currently, hence the code for native decimal implementation isn't actually being used.

Jefffrey avatar Nov 26 '25 12:11 Jefffrey

Unfortunately I don't currently have the capacity to try guide this PR along; I feel some of my previous comments have still not been addressed and a lot of the code introduced here seems hard to follow along & understand.

Hopefully I'll be able to circle back to this in a few weeks.

Jefffrey avatar Dec 09 '25 08:12 Jefffrey

Thanks for being patient with this one. This PR has been particularly tricky for me. I appreciate your feedbacks as always and try to incorporate. Let me know whenenver you get the time for moving forward with this, I will try to follow along with your suggestions.

kumarUjjawal avatar Dec 09 '25 08:12 kumarUjjawal

I took another quick look, and I think some of my prior comments were inaccurate, I apologize for that. I was mainly thrown by some parts of the code which lead me down the wrong line of thinking; for example, this signature:

fn variance_signature() -> Signature {
    Signature::one_of(
        vec![
            TypeSignature::Numeric(1),
            TypeSignature::Coercible(vec![Coercion::new_exact(
                TypeSignatureClass::Decimal,
            )]),
        ],
        Volatility::Immutable,
    )
}

The coercible part there is redundant since Numeric(1) should already cover decimals; I was mistaken in thinking the new code paths introduced in this PR were not being in effect because of this misconception.

My high level thoughts on this PR now are:

  • Would benefit from cleaning up/refactoring code to be more concise
    • For example, the new is_numeric_or_decimal() function seems extremely redundant considering a decimal type is already numeric; having code like this is quite confusing and makes it harder to understand what's actually important in this code that we should be reviewing more carefully
  • Seems odd there is a manual implementation of casting decimal to float here (is that what is happening?)
  • Is it possible to maintain the return type as decimal instead of casting to float, or does that introduced precision loss?

Jefffrey avatar Dec 10 '25 14:12 Jefffrey

I am taking another crack at it and try to refactor the code along with removing the redundant code. The code changes I'm doing still pretty complex to me at the moment and could be adding more complexity. Let's see how it goes. I will push the changes for you to take a look, if it still seems not upto the mark. I can make this as draft and tackle this later on when I have more confidence on the implementation.

kumarUjjawal avatar Dec 10 '25 18:12 kumarUjjawal