datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

feat: support for AnyExpression

Open ovr opened this issue 2 years ago • 10 comments

Which issue does this PR close?

https://github.com/apache/arrow-datafusion/issues/2548

Rationale for this change

This PR implements partial support for ANY operator, only for = & <>.

image

Are there any user-facing changes?

This PR doesn't introduce any breaking changes, only new functionality

ovr avatar May 16 '22 16:05 ovr

I added the api change label since this introduces a new AnyExpr variant in the Expr enum, which is a breaking change.

andygrove avatar May 16 '22 16:05 andygrove

@alamb @andygrove is it required to add a new expression for Ballista? I did that in 218a917, but I don't use it. Thanks

ovr avatar May 16 '22 18:05 ovr

@alamb @andygrove is it required to add a new expression for Ballista? I did that in 218a917, but I don't use it. Thanks

I assume you had to add that to fix compilation issues because ballista has exhaustive matching of all expressions? The changes look reasonable to me.

Note that we also have a datafusion.proto (in fact, we have two copies, as noted in https://github.com/apache/arrow-datafusion/issues/2514) that we should ideally also add new expressions to (but we have not been doing this consistently).

andygrove avatar May 17 '22 19:05 andygrove

I plan to review this carefully today -- thanks @ovr

alamb avatar May 18 '22 09:05 alamb

@ovr could you merge latest from master into this branch to pick up https://github.com/apache/arrow-datafusion/pull/2567

andygrove avatar May 18 '22 16:05 andygrove

Thanks -- I'll try and find some more time to review this PR.

On Wed, May 18, 2022 at 5:54 PM Dmitry Patsura @.***> wrote:

@.**** commented on this pull request.

In datafusion/physical-expr/src/expressions/any.rs https://github.com/apache/arrow-datafusion/pull/2549#discussion_r876413237 :

  • fn evaluate(&self, batch: &RecordBatch) -> Result<ColumnarValue> {
  •    let value = match self.value.evaluate(batch)? {
    
  •        ColumnarValue::Array(array) => array,
    
  •        ColumnarValue::Scalar(scalar) => scalar.to_array(),
    
  •    };
    

I haven't reviewed this code super carefully but I wonder what you think about reusing more of the code for InList?

Probably, the nearest operator will be ALL, but it's not implemented right now. I tried to reuse the code of InList, but come to the decision that it will be easier/correct to combine it with ANY in the future.

It seems like the implementations of x IN (, ) and x ANY () are almost exactly the same except for the fact that the the x = ANY() has the list as a single ?

Nope, because it supports more operators: <, <=, >, >= (it's not implemented in this PR, but anyway).

I mention this as it is likely the IN / NOT IN are more optimized (use a hashset for testing, for example) as well as already written.

Yes, but IN supports only the vector of scalars in DF. It's not the same as ANY, because it supports List (real column) or scalar.

For example SELECT left, values FROM table:

left | right LIst(Int64) - another logic is required

  1. | [1,2,3] - ArrayRef = PrimitiveArray<Int64> - another logic is requied

  2. |. [2,3,4]

  3. |. [4,5,6]

  4. | [4,5,6]

  5. |. [4,5,6]

In the case of using the whole column, it is required to use another handling of interacting on values ☝️ . Probably here, will be more correct to go by steps:

  • Support more operators
  • Introduce ALL operator + extract the same logic with ANY.

Thanks

— Reply to this email directly, view it on GitHub https://github.com/apache/arrow-datafusion/pull/2549#discussion_r876413237, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADXZMLLPQHWVVSZ2KIMKULVKVRKVANCNFSM5WCC6J2Q . You are receiving this because you were mentioned.Message ID: @.***>

alamb avatar May 20 '22 12:05 alamb

Codecov Report

Merging #2549 (2d32cb2) into master (4b3eb1c) will decrease coverage by 0.09%. The diff coverage is 69.00%.

@@            Coverage Diff             @@
##           master    #2549      +/-   ##
==========================================
- Coverage   84.64%   84.54%   -0.10%     
==========================================
  Files         267      268       +1     
  Lines       46926    47239     +313     
==========================================
+ Hits        39719    39937     +218     
- Misses       7207     7302      +95     
Impacted Files Coverage Δ
datafusion/core/src/datasource/listing/helpers.rs 95.34% <ø> (ø)
...ion/core/src/optimizer/common_subexpr_eliminate.rs 87.81% <0.00%> (-0.74%) :arrow_down:
...afusion/core/src/optimizer/simplify_expressions.rs 81.90% <0.00%> (-0.10%) :arrow_down:
datafusion/core/src/optimizer/utils.rs 31.77% <0.00%> (-0.49%) :arrow_down:
datafusion/expr/src/utils.rs 91.79% <ø> (ø)
datafusion/physical-expr/src/expressions/mod.rs 100.00% <ø> (ø)
datafusion/proto/src/from_proto.rs 33.94% <0.00%> (-0.16%) :arrow_down:
datafusion/proto/src/to_proto.rs 55.74% <0.00%> (-0.59%) :arrow_down:
datafusion/sql/src/utils.rs 50.19% <0.00%> (-0.79%) :arrow_down:
datafusion/expr/src/expr_schema.rs 66.88% <16.66%> (-2.04%) :arrow_down:
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 4b3eb1c...2d32cb2. Read the comment docs.

codecov-commenter avatar Jun 02 '22 16:06 codecov-commenter

Rebased ✅ cC @alamb @tustvold

ovr avatar Jun 02 '22 16:06 ovr

I reviewed the logical plan, SQL parsing, and optimizer changes and LGTM. I did not review the execution part.

It might make reviews easier for future work like this to have separate PRs for logical versus physical parts. I would have been more comfortable reviewing and approving the logical plan parts of this.

andygrove avatar Jun 03 '22 00:06 andygrove

Hi @ovr What do you think about breaking this up into smaller PRs and starting with the logical plan changes and SQL parsing? I think this might make it easier to start getting parts of this merged in.

andygrove avatar Jun 21 '22 23:06 andygrove

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

alamb avatar Jan 14 '23 11:01 alamb