datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] Improved support for nested / structured types (`Struct` , `List`, `ListArray`, and other Composite types)

Open alamb opened this issue 3 years ago • 10 comments
trafficstars

Is your feature request related to a problem or challenge? Please describe what you are trying to do. This ticket is designed to capture the work needed to properly support Arrow Struct types in DataFusion

https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html says that nested types are not supported; The are not fully supported, but there are parts of the support already present such as a way to serialize them via ArrowWriter and using field["nested_field"] syntax

Describe the solution you'd like Research, and describe / implement what is else remains for proper support.

Array (ListArray) support:

  • [ ] https://github.com/apache/arrow-datafusion/issues/6980
  • [ ] https://github.com/apache/arrow-datafusion/issues/6560
  • [x] https://github.com/apache/arrow-datafusion/issues/6555
  • [ ] #9252

Map (MapArray) support:

  • [x] https://github.com/apache/arrow-datafusion/issues/8262

Struct (StructArray) support:

  • https://github.com/apache/arrow-datafusion/issues/5861
  • https://github.com/apache/arrow-datafusion/issues/9820
  • #10207
  • https://github.com/apache/datafusion/issues/10264

Union (UnionArray) support

  • #10206

Other

Known issues so far:

  • [x] https://github.com/apache/arrow-datafusion/issues/2179 from @Cheappie
  • [x] https://github.com/apache/arrow-datafusion/issues/2043 from @lquerel
  • [ ] https://github.com/apache/arrow-datafusion/issues/3617 from @kesavkolla
  • [ ] https://github.com/apache/arrow-datafusion/issues/6074
  • [ ] https://github.com/apache/arrow-datafusion/issues/1222
  • [ ] https://github.com/apache/arrow-datafusion/issues/2581
  • [x] https://github.com/apache/arrow-datafusion/discussions/6446
  • [x] https://github.com/apache/arrow-datafusion/issues/6075
  • [x] https://github.com/apache/arrow-datafusion/issues/6119
  • [x] https://github.com/apache/arrow-datafusion/issues/6561
  • [x] https://github.com/apache/arrow-datafusion/issues/6556
  • [x] https://github.com/apache/arrow-datafusion/issues/6557
  • [ ] https://github.com/apache/arrow-datafusion/issues/6559
  • [x] https://github.com/apache/arrow-datafusion/issues/6603
  • [ ] https://github.com/apache/arrow-datafusion/issues/6602
  • [x] https://github.com/apache/arrow-datafusion/issues/6598
  • [ ] https://github.com/apache/arrow-datafusion/issues/6631
  • [ ] https://github.com/apache/arrow-datafusion/issues/3617
  • [x] https://github.com/apache/arrow-datafusion/issues/6743
  • [ ] https://github.com/apache/arrow-datafusion/issues/7012
  • [ ] https://github.com/apache/arrow-datafusion/issues/8334

alamb avatar Apr 24 '22 12:04 alamb

This https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/physical_plan/file_format/mod.rs#L238 is one reason of errors related to column projection. It compares the complete enum, failing on different field order.

Arrow has a method to compare data types (https://github.com/apache/arrow-rs/blob/master/arrow/src/datatypes/datatype.rs#L674). I think this method should me made public, and used in above.

Currently datafusion uses match_field_names (default true), https://github.com/apache/arrow-rs/blob/master/arrow/src/record_batch.rs#L153 causing the error.

nl5887 avatar Jun 09 '22 20:06 nl5887

Thanks for the investigation @nl5887 -- that sounds definitely plausible. Feel free to file a PR with proposed changed -- we would love to review them

alamb avatar Jun 10 '22 10:06 alamb

This one is also related: https://github.com/apache/arrow-datafusion/issues/2581

nl5887 avatar Jun 26 '22 13:06 nl5887

Reminder to write docs: #1222

tv42 avatar Feb 22 '23 22:02 tv42

Potential to add to list #7012

alexwilcoxson-rel avatar Aug 30 '23 22:08 alexwilcoxson-rel

We are starting to make progress on struct support --

There is a PR up to support named_struct https://github.com/apache/arrow-datafusion/pull/9743 and work afoot to support nicer literal syntax: https://github.com/apache/arrow-datafusion/issues/9820 🚀

alamb avatar Mar 27 '24 12:03 alamb

Hi, i think unnest support for struct can be an item in this epic right?

toaiduongdh avatar Apr 27 '24 04:04 toaiduongdh

Hi, i think unnest support for struct can be an item in this epic right?

That would make sense to me -- is there a ticket that describes what this means?

alamb avatar Apr 27 '24 10:04 alamb

i created a ticket: https://github.com/apache/datafusion/issues/10264

duongcongtoai avatar Apr 27 '24 11:04 duongcongtoai

i created a ticket: #10264

Thank you. I added this to the list in the ticket description

alamb avatar Apr 30 '24 13:04 alamb

I added an issue to support recursive unnest: https://github.com/apache/datafusion/issues/10660, i think it shoul belong to this epic

duongcongtoai avatar May 25 '24 06:05 duongcongtoai

I added an issue to support recursive unnest: #10660, i think it shoul belong to this epic

Added

alamb avatar May 25 '24 11:05 alamb

I added an issue to check the duplicate or null name for struct: https://github.com/apache/datafusion/issues/11438

goldmedal avatar Jul 12 '24 15:07 goldmedal

I think #11445 is related to this epic

Throne3d avatar Jul 13 '24 19:07 Throne3d

I think #11445 is related to this epic

Thank you -- added

alamb avatar Jul 14 '24 15:07 alamb

Right now datafusion doesn't support struct evolution very well. Imagine you have a struct named customData with field someOptionEnabled in one parquet file, later down the line you add a new field newAddedOption to the customData struct in another parquet file. Currently when you try and SELECT * FROM table you'll get this error:

{"message":"Failed to collect DataFrame batches: Plan(\"Cannot cast file schema field customData of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) to table schema field of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \\\"newAddedOption\\\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])\")","status":"error"}

Feels like we should handle this more gracefully. cc @alamb

I'm happy to make contributions if someone can point me to the right places to look.

TheBuilderJR avatar Aug 13 '24 20:08 TheBuilderJR

Right now datafusion doesn't support struct evolution very well. Imagine you have a struct named customData with field someOptionEnabled in one parquet file, later down the line you add a new field newAddedOption to the customData struct in another parquet file. Currently when you try and SELECT * FROM table you'll get this error:

{"message":"Failed to collect DataFrame batches: Plan(\"Cannot cast file schema field customData of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]) to table schema field of type Struct([Field { name: \\\"someOptionEnabled\\\", data_type: Boolean, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: \\\"newAddedOption\\\", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }])\")","status":"error"}

Feels like we should handle this more gracefully. cc @alamb

I agree

I'm happy to make contributions if someone can point me to the right places to look.

My suggestion is to start with filing a ticket with a self contained reproducer (either rust code or SQL) that shows what you are trying to do.

This would likely become part of the test of any code improvement we make, as well as providing some more detail for other contributors to help point to the right place in the code

alamb avatar Aug 15 '24 11:08 alamb