datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Support bounds evaluation for temporal data types

Open ch-sc opened this issue 11 months ago • 17 comments

Which issue does this PR close?

As discussed in 14237 temporal data types should be supported in bounds evaluation.

  • Closes #14237.

Rationale for this change

We want to extend the number of expressions that can evaluate bounds to calculate statistics and enable better/more query plan optimisations.

What changes are included in this PR?

  • Additional data types are supported: Timestamp, Date, Time, Duration, Interval
  • Additional binary operators are supported: ~NotEq~, IsDistinctFrom, IsNotDistinctFrom
  • Whether an expression supports bounds evaluation or not is now part of the PhysicalExpr trait and therefore decided by every expression type itself. This will also enable UDFs to implement bounds evaluation.

Are these changes tested?

yes

Are there any user-facing changes?

PhysicalExpr and ScalarUDFImpl change, but both provide default implementations for additional functions.

ch-sc avatar Feb 06 '25 11:02 ch-sc

Thank you @ch-sc for working on this. When you need a review, I can do that if you ping me.

berkaysynnada avatar Feb 07 '25 07:02 berkaysynnada

Thanks @berkaysynnada! Yeah I'd like to get a review.

There is still a bug, though, that I don't understand yet. Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

ch-sc avatar Feb 07 '25 12:02 ch-sc

Sometimes the sort operator gets removed as can be seen in the repartition_scan.slt test.

Seems like you're losing the order requirement somehow

berkaysynnada avatar Feb 07 '25 12:02 berkaysynnada

@berkaysynnada do you have time to take another look? :)

NotEq leads to the removal of the sort operator.

I debugged into this and noticed that the EnforceSorting optimiser removes SortExec. enforce_sorting/mod.rs:415 However, prior to that the order requirement somehow is lost and I don't understand why that happens.

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

ch-sc avatar Feb 12 '25 13:02 ch-sc

@berkaysynnada do you have time to take another look? :)

Thanks again for driving this forward :) I'm a bit busy these days, but I'd love to go over it. If it's not urgent for you, would you mind waiting until the weekend?

I removed NotEq again from the supported binary operators to move this forward. We might want to look into this in a follow-up PR. WDYT?

Thanks for the heads-up. I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

I took a quick look, and you can use the union logic there: https://github.com/Fly-Style/datafusion/blob/fee3023a63227f0a22ac2da1d040d373cc028c34/datafusion/expr-common/src/interval_arithmetic.rs#L667 It seems to follow a better style, IMO.

berkaysynnada avatar Feb 12 '25 14:02 berkaysynnada

I try to add support for NotEq. If I find a solution, I can fix it within this PR itself—if that's okay with you.

Sure, feel free to make any adjustment as you see fit.

ch-sc avatar Feb 13 '25 12:02 ch-sc

I took a quick look, and you can use the union logic there

I thought a bit more about the union logic and came up with another solution.

Sometimes intervals are not overlapping and sometimes have no bound in one direction (NULL). Typically the union of non-overlapping intervals would result in a set of intervals, but there is no support of this.

Here are some examples of what I think should be correct, without interval sets support:

(1, 2)    ∪ (3,4)        = (1,4)
(1, NULL) ∪ (NULL, NULL) = (1, NULL)
(NULL, 1) ∪ (NULL, NULL) = (NULL, 1)
(3, NULL) ∪ (NULL, 1)    = (NULL, NULL)
(5, NULL) ∪ (1, 2)       = (1, NULL)

I adapted the union logic accordingly and added tests. Happy for your feedback!

ch-sc avatar Feb 14 '25 19:02 ch-sc

Looks like I confused the meaning of nulls in intervals. I switched to the union logic @berkaysynnada suggested.

ch-sc avatar Feb 18 '25 14:02 ch-sc

@berkaysynnada thank you for your review 🙂

Rather than introducing new supports_() API's, should we force the users to infer the support by the evaluate API?

I agree, something like that would be ideal. I'll look into it today

ch-sc avatar Feb 25 '25 13:02 ch-sc

Hi @ch-sc. Do you plan to complete the work here? (as it's so almost)

berkaysynnada avatar Mar 16 '25 14:03 berkaysynnada

Hi @berkaysynnada, I should be able to spend some time on this at the end of this week.

ch-sc avatar Mar 17 '25 07:03 ch-sc

Hi @berkaysynnada, can you take another look to move this forward? :)

ch-sc avatar Apr 09 '25 11:04 ch-sc

Hi @berkaysynnada, can you take another look to move this forward? :)

I will do but some tests are failing. I'll have some time probably in the evening or tomorrow morning

berkaysynnada avatar Apr 10 '25 11:04 berkaysynnada

Hi @ch-sc. Reviewing while tests are failing is not very efficient. Are you dealing with them?

berkaysynnada avatar Apr 11 '25 08:04 berkaysynnada

Sorry @berkaysynnada, I got side-tracked from this yesterday. The test is fixed.

ch-sc avatar Apr 11 '25 14:04 ch-sc

Thank you @ch-sc. I will review it today

berkaysynnada avatar Apr 17 '25 07:04 berkaysynnada

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 17 '25 02:06 github-actions[bot]