datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

[EPIC] Add Spark expression coverage

Open viirya opened this issue 1 year ago • 11 comments

What is the problem the feature request solves?

This is an umbrella ticket for the list of unsupported Spark expressions. This is not necessary comprehensive list of all Spark expressions because they are too many. We can start from frequently used expressions.

Hash expressions #205

  • [ ] Md5: supported by DataFusion, but Comet disables it for now. DataFusion crypto_expressions includes blake3 which cannot be built on Mac platform. We cannot separately enable only md5 in DataFusion.
  • [x] Murmur3Hash
  • [x] XxHash64
  • [ ] HiveHash
  • [ ] Sha1
  • [ ] Sha2
  • [ ] Crc32

Datetime expressions

Conditional expressions

  • [x] If
  • [x] CaseWhen

Arithmetic expressions

  • [x] UnaryMinus
  • [ ] UnaryPositive
  • [x] Abs
  • [x] Add
  • [x] Subtract
  • [x] Multiply
  • [x] Divide
  • [ ] IntegralDivide
  • [x] Remainder
  • [ ] Pmod
  • [ ] Least
  • [ ] Greatest

Bitwise expressions

  • [x] BitwiseAnd
  • [x] BitwiseOr
  • [x] BitwiseXor
  • [x] BitwiseNot
  • [ ] BitwiseCount
  • [ ] BitwiseGet

String expressions

  • [x] Ascii
  • [x] BitLength
  • [x] StringInstr
  • [x] StringRepeat
  • [x] StringReplace
  • [x] StringTranslate
  • [x] StringTrim
  • [x] StringTrimLeft
  • [x] StringTrimRight
  • [x] StringTrimBoth
  • [x] Upper
  • [x] Lower
  • [x] Length
  • [x] InitCap
  • [x] Chr
  • [x] ConcatWs

Math expressions

  • [x] Acos
  • [x] Asin
  • [x] Atan
  • [x] Atan2
  • [x] Ceil
  • [x] Cos
  • [x] Exp
  • [x] Floor
  • [x] Log
  • [x] Log10
  • [x] Log2
  • [x] Pow
  • [x] Round (3.3+)
  • [x] Signum
  • [x] Sin
  • [x] Sqrt
  • [x] Tan
  • [ ] Asinh
  • [ ] Sinh
  • [ ] Csc
  • [ ] Rint
  • [ ] Log1p
  • [ ] Factorial
  • [ ] RoundFloor
  • [ ] Expm1
  • [ ] Conv
  • [ ] Acosh
  • [ ] Cosh
  • [ ] Sec
  • [ ] RoundCeil
  • [ ] Cbrt
  • [ ] Pi
  • [ ] EulerNumber
  • [ ] Cot
  • [ ] Tanh
  • [ ] Atanh
  • [ ] ToDegrees
  • [ ] ToRadians
  • [ ] Bin
  • [ ] Hex
  • [ ] Unhex
  • [x] ShiftLeft
  • [x] ShiftRight
  • [ ] ShiftRightUnsigned
  • [ ] Hypot
  • [ ] Logarithm
  • [ ] BRound
  • [ ] WidthBucket

Predicates

  • [x] In
  • [x] InSet
  • [x] Not
  • [ ] InSubquery
  • [x] And
  • [x] Or
  • [x] EqualTo
  • [x] EqualNullSafe
  • [ ] EqualNull
  • [x] LessThan
  • [x] LessThanOrEqual
  • [x] GreaterThan
  • [x] GreaterThanOrEqual

Null expressions

  • [x] Coalesce

Aggregate expressions

  • [x] Count
  • [x] Sum
  • [x] Max
  • [x] Min
  • [x] Avg
  • [x] First
  • [x] Last
  • [x] BitAnd
  • [x] BitOr
  • [x] BitXor
  • [ ] Covariance: #234

Others

  • [x] Cast

...

Describe the potential solution

No response

Additional context

No response

viirya avatar Mar 31 '24 06:03 viirya

The list of Spark expression can be found https://spark.apache.org/docs/latest/api/sql/index.html

comphead avatar Apr 04 '24 23:04 comphead

As an umbrella issue, If you are going to make a list of frequently used expressions, maybe you can add the hash expressions(which I created #205 earlier) as one of the categories/list.

advancedxy avatar Apr 07 '24 03:04 advancedxy

I added hash expressions. Free feel to edit the expression list in the issue description to add more expressions.

viirya avatar Apr 07 '24 05:04 viirya

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan. This will allow Comet be enabled when running queries like

select sqrt(2) from (select 1 union all select 2)

Once its done, we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

comphead avatar Apr 07 '24 17:04 comphead

I was thinking to add Spark Scan OneRowRelation scan support in addition to Parquet Scan.

I am adding RowToColumnar support in #206. Once it's done, I think it's trivial to add RDDScanExec(which OneRowRelation is translated to as PhysicalPlan) support.

we can just download all the queries from https://spark.apache.org/docs/latest/api/sql/index.html and run it automatically and see the coverage. How does it sound?

That sounds like a great idea.

advancedxy avatar Apr 08 '24 04:04 advancedxy

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec. I'll check if its doable

comphead avatar Apr 09 '24 19:04 comphead

Another potential solution we can do is to transform OneRowRelation and to use DF PlaceholderRowExec

Of course, that would be more performant and straightforward.

advancedxy avatar Apr 10 '24 12:04 advancedxy

It sounds good if we can automatically test expression coverage, although I'm not sure if it is easy to do.

viirya avatar Apr 10 '24 17:04 viirya

I have an idea to do that. Planning to create a draft soon. Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

It will be easier to do if Comet supported OneRowRelation but even without it there is a workaround. Once all builtinn function queries done there should be some HTML with total results

comphead avatar Apr 10 '24 17:04 comphead

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class. See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

advancedxy avatar Apr 11 '24 03:04 advancedxy

Basic idea is to grab queries from https://spark.apache.org/docs/latest/api/sql/index.html and create a separate task in Comet which will check if the Comet being triggered.

Hmmm, I think you can get expression example usage directly from its annotated class. See 'org.apache.spark.sql.expressions.ExpressionInfoSuite' for how to get examples directly.

Great idea, I think I'm able to fetch it as here https://github.com/apache/spark/blob/6fdf9c9df545ed50acbce1ec874625baf03d4d2e/sql/core/src/test/scala/org/apache/spark/sql/expressions/ExpressionInfoSuite.scala#L166

comphead avatar Apr 11 '24 15:04 comphead