glaredb icon indicating copy to clipboard operation
glaredb copied to clipboard

Support different schemas when scanning multiple files

Open scsmithr opened this issue 2 years ago • 3 comments

Description

Given a parquet file with column a, and another parquet with column b, I want to be able to query both in the same parquet_scan call and have their schemas be unioned.

A use case here might be schema evolution, e.g. a new column gets added to parquet files that are generated at a later point in time.

And the result could be:

---------------
|  a   |  b   |
---------------
|  1   | NULL |
| NULL |  2   |
---------------

scsmithr avatar Nov 07 '23 18:11 scsmithr

I think we'd want to have this be a mode or flag that we set on a table. The term of art for this might be "wide column store".

Obviously, a reasonable fit for document databases (and maybe we can handle this by default there). In general the engine could just say "if a field doesn't exist for a record, it's null" and proceed from there.

It implies a very different implementation, but this might effectively be something that roughly resembles a full outer join (maybe full outer self join?). This might work well as a property for some kind of synthetic table that combined a bunch of different data files or data sources (not just in the schema evolution sense.)

tychoish avatar Nov 07 '23 18:11 tychoish

I think we'd want to have this be a mode or flag that we set on a table.

I was thinking the same thing. Idk if it should be a global setting, or a parameter on the table function

global setting:

set UNION_SCHEMAS=true; -- for the sake of bikeshedding, lets just call the flag: `UNION_SCHEMAS`
select * from parquet_scan(['file1.parquet', 'file2.parquet'])

per function:

select * from parquet_scan(
  ['file1.parquet', 'file2.parquet'],
  union_schemas => true
)

universalmind303 avatar Nov 09 '23 17:11 universalmind303

I definitely like the per-function case.

tychoish avatar Nov 09 '23 20:11 tychoish