datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

[EPIC] Complex Type Support

Open andygrove opened this issue 1 year ago • 0 comments

What is the problem the feature request solves?

We would like Comet to fully support complex types (arrays, structs, and maps). This issue is for tracking all of the individual issues.

Google doc: https://docs.google.com/document/d/1eiDFEScPjxBMahJW6lmBI8JjVlI6CwhiJgkTSsTvPVY/edit?usp=sharing

Reading Complex Types from Parquet/Iceberg, Part 1

We now have new native_datafusion and native_iceberg_compat scans that use DataFusion's ParquetExec which already supports complex types. These new scans are not fully implemented yet and the first thing we need to do is fix all failing tests when these scans are made the default.

Goal is to complete this section for the 0.7.0 release before end of March.

  • [x] Reduce code duplication betwen native_datafusion and native_iceberg_compat
  • [ ] Add Parquet reader metrics for both paths
  • [ ] Schema adapter handling of timestamps (including int96, timestamp_ntz)
  • [ ] Schema adapter handling of decimals (decimal128 config)
  • [ ] https://github.com/apache/datafusion-comet/issues/1441

Reading Complex Types from Parquet/Iceberg, Part 2

Aiming for 0.8.0 release.

  • Support reading complex types with native_datafusion scan
    • [x] Array
    • [x] Struct
    • [ ] Map
  • Support reading complex types with native_iceberg_compat scan
    • [x] Array
    • [x] Struct
    • [ ] Map

Reading Complex Types from Parquet/Iceberg, Part 3

These items may not be relevant to all users, but for some environments, there is more work required to allow the new ParquetExec scans to be used. Comet's current default native_comet scan is JVM-based and leverages Hadoop data source functionality that is not available in DataFusion.

  • Wrap Hadoop file readers in JNI so that we can call from Rust, to support use cases such as encryption
  • https://github.com/apache/datafusion-comet/issues/1082
  • Custom Authentication for cloud storage
  • HDFS Support

Supporting expressions that operate on complex types

  • Expressions
    • [ ] Array
      • [ ] https://github.com/apache/datafusion-comet/issues/1042
      • [ ] Update to_json to support arrays
      • [ ] Implement CAST from array to string
    • [ ] Struct
      • [x] https://github.com/apache/datafusion-comet/issues/815
      • [x] https://github.com/apache/datafusion-comet/issues/814
      • [ ] https://github.com/apache/datafusion-comet/issues/813
    • [ ] Map
      • [ ] https://github.com/apache/datafusion-comet/issues/1044
      • [ ] Implement CAST from Map to String
      • [ ] Add map support to to_json

Performance

  • Create benchmarks for complex types

Testing

  • https://github.com/apache/datafusion-comet/issues/1486
  • https://github.com/apache/datafusion-comet/issues/1489
  • Fuzz testing

Older / related issues:

  • https://github.com/apache/datafusion-comet/issues/1040
  • https://github.com/apache/datafusion-comet/issues/434

Describe the potential solution

No response

Additional context

No response

andygrove avatar Oct 30 '24 14:10 andygrove