dlt icon indicating copy to clipboard operation
dlt copied to clipboard

support nested types for arrow / parquet

Open rudolfix opened this issue 7 months ago • 0 comments

Background Currently nested types (arrays, maps, records), if present in arrow schema, are mapped into complex dlt type. complex types are stored in JSON types in the destination (if available) or in STRING. We want to support nested types in dlt type system, convert from/to arrow schemas and to various SQL dialect. With such support in place, new json normalizer may be implemented which generates nested types instead of child tables.

Here's a list of proposed changes

  1. rename complex to json to clearly indicate what this type represents. leave complex as deprecated (we may need to add schema engine migrations)
  2. add new type called nested and fields hint that define arrow-like schema. here's some research needed (should we just adopt arrow schemas?). Our pyarrow helpers must both generate and parse fields. Optionally we may upgrade Pydantic helpers to generate nested types as well.
  3. extend type comparison (to detect if schema must evolve) to include fields. we'll need some canonical form for the schema (ie. serialized to json or one of SQL representations)
  4. Evolving nested types. Some destinations (BigQuery) allow to add new fields to nested fields. Most (probably) don't. Initially we may support only the latter and always create "variant" column when types differ (which we now support)
  5. We must extend type mappers to generate SQL representation of fields. Here we may consider sqlglot which works for at least some of our destinations. Initially we support: duckdb, bigquery and snowflake

rudolfix avatar Jul 14 '24 21:07 rudolfix