Support different schemas when scanning multiple files
Description
Given a parquet file with column a, and another parquet with column b, I want to be able to query both in the same parquet_scan call and have their schemas be unioned.
A use case here might be schema evolution, e.g. a new column gets added to parquet files that are generated at a later point in time.
And the result could be:
---------------
| a | b |
---------------
| 1 | NULL |
| NULL | 2 |
---------------
I think we'd want to have this be a mode or flag that we set on a table. The term of art for this might be "wide column store".
Obviously, a reasonable fit for document databases (and maybe we can handle this by default there). In general the engine could just say "if a field doesn't exist for a record, it's null" and proceed from there.
It implies a very different implementation, but this might effectively be something that roughly resembles a full outer join (maybe full outer self join?). This might work well as a property for some kind of synthetic table that combined a bunch of different data files or data sources (not just in the schema evolution sense.)
I think we'd want to have this be a mode or flag that we set on a table.
I was thinking the same thing. Idk if it should be a global setting, or a parameter on the table function
global setting:
set UNION_SCHEMAS=true; -- for the sake of bikeshedding, lets just call the flag: `UNION_SCHEMAS`
select * from parquet_scan(['file1.parquet', 'file2.parquet'])
per function:
select * from parquet_scan(
['file1.parquet', 'file2.parquet'],
union_schemas => true
)
I definitely like the per-function case.