dlt
dlt copied to clipboard
support nested types for arrow / parquet
Background
Currently nested types (arrays, maps, records), if present in arrow schema, are mapped into complex
dlt
type. complex
types are stored in JSON types in the destination (if available) or in STRING.
We want to support nested types in dlt
type system, convert from/to arrow schemas and to various SQL dialect. With such support in place, new json normalizer may be implemented which generates nested types instead of child tables.
Here's a list of proposed changes
- rename
complex
tojson
to clearly indicate what this type represents. leavecomplex
as deprecated (we may need to add schema engine migrations) - add new type called
nested
andfields
hint that define arrow-like schema. here's some research needed (should we just adopt arrow schemas?). Ourpyarrow
helpers must both generate and parsefields
. Optionally we may upgrade Pydantic helpers to generate nested types as well. - extend type comparison (to detect if schema must evolve) to include
fields
. we'll need some canonical form for the schema (ie. serialized to json or one of SQL representations) - Evolving nested types. Some destinations (BigQuery) allow to add new fields to nested fields. Most (probably) don't. Initially we may support only the latter and always create "variant" column when types differ (which we now support)
- We must extend type mappers to generate SQL representation of
fields
. Here we may considersqlglot
which works for at least some of our destinations. Initially we support: duckdb, bigquery and snowflake