datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

Improve coverage of tests for array expressions

Open andygrove opened this issue 11 months ago • 4 comments

What is the problem the feature request solves?

I would like to start implementing some shared test code for array expressions to make it easier to test for edge cases such as null or empty arrays.

The following functions need improved tests:

  • [ ] array_remove
  • [ ] array_contains
  • [ ] array_append
  • [ ] array_intersect
  • [ ] array_join
  • [ ] array_repeat
  • [ ] array_compact
  • [ ] arrays_overlap

Describe the potential solution

No response

Additional context

No response

andygrove avatar Jan 12 '25 19:01 andygrove

We should also start testing with all supported data types for the array expressions.

andygrove avatar Jan 13 '25 15:01 andygrove

Here is an example of an issue that we should have been able to catch using fuzz testing:

https://github.com/apache/datafusion-comet/issues/1307

andygrove avatar Jan 19 '25 00:01 andygrove

Would it help to extend makeParquetFileAllTypes or create a new method which creates a parquet file that contains more complex structures like lists (with all datatypes and structs because the current tests usually test the implementation with arrays only containing 1-3 elements) and structs to improve the tests with additional test cases?

However I think that requires complex type support for the comet native parquet reader.

NoeB avatar Jan 19 '25 15:01 NoeB

Would it help to extend makeParquetFileAllTypes or create a new method which creates a parquet file that contains more complex structures like lists (with all datatypes and structs because the current tests usually test the implementation with arrays only containing 1-3 elements) and structs to improve the tests with additional test cases?

However I think that requires complex type support for the comet native parquet reader.

Yes, exactly. I have started down this path in https://github.com/apache/datafusion-comet/pull/1308

As you said, until we have complex type support we cannot generate Parquet files containing complex types. For now I had to work around this by generating a Parquet file with primitive types and then use the array expression in SQL to create arrays to pass into array_remove.

Complex type support for Parquet is coming very soon though.

I did try disabling native Parquet scans and use the COMET_CONVERT_FROM_PARQUET_ENABLED feature, but this failed with an error that I need to investigate.

andygrove avatar Jan 19 '25 16:01 andygrove