datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

dynamic filter refactor

Open jayzhan211 opened this issue 9 months ago • 4 comments

Which issue does this PR close?

  • Closes #.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jayzhan211 avatar Apr 11 '25 13:04 jayzhan211

https://github.com/apache/datafusion/pull/15568#discussion_r2038773841

why the change is equivalent to yours in the high level idea.

  1. DynamicFilterPhysicalExpr gets initialized at planning time with a known set of children but a placeholder expression (lit(true))

The same.

  1. with_new_children is called making a new DynamicFilterPhysicalExpr but with the children replaced (let's ignore how that happens internally for now)

We need to replace children, and we can achieve this and get the result by snapshot

  1. update is called on the original reference with an expression that references the original children. This is propagated to all references, including those with new children, because of the Arc<RwLock<...>>.

We can keep Arc<RwLock<T>> and update the inner or even create another new source filter.

  1. evaluate is called on one of the references that previously had with_new_children called on it. Since update was called, which swapped out inner, the children of this new inner need to be remapped to the children that we currently expose externally.

We can call evaluate on snapshot because snapshot is already remapped. Your version need to remap for each evaluate called, but snapshot in my version done it once, and we evaluate on it without remap.

The improvement of this chanage

  1. We have correct with_new_children because we update the source filter now.
  2. DynamicFilterPhysicalExpr is basically filter expression: Arc<dyn PhysicalExpr>>. We have a simple interface with the same capability now.

Concern

  1. reassign_predicate_columns is replaced by snapshot but you think we can't do this kind of change because of API churn. I think this is not an issue because only DatafusionArrowPredicate is used.
  2. Do we need Lock for source filter? I think we can create another new DynamicFilterPhysicalExpr at all. But maybe there is some reasons we can't, we can discuss further on this.

jayzhan211 avatar Apr 12 '25 02:04 jayzhan211

Hey @jayzhan211 thank you for putting the work into trying to clarify this.

At this point I think it would be best to wait for #15566 or a PR that replaces it to be merged so that we can work against an actual use case / implementation of these dynamic filters. Otherwise I think it's a bit hard to communicate in such abstract terms. Once we're looking at a concrete use case it will be easier to make a PR to replace this implementation.

Would it be okay with you to wait until that happens to continue this discussion?

Sorry if merging a PR with a bad implementation becomes problematic... luckily it's problematic for us, not end users, since this is all private implementations.

adriangb avatar Apr 12 '25 05:04 adriangb

Blockers are gone. I think we can focus on this now

berkaysynnada avatar Apr 17 '25 15:04 berkaysynnada

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jun 18 '25 02:06 github-actions[bot]