seafowl
seafowl copied to clipboard
Figure out what to do with `table_column` catalog table and bulk schema loading in general
Currently we're not really using our Schema
for anything but the to_column_names_types
call when persisting the columns to the table_column
metadata table. So it's possible to remove that Schema
altogether and just use the underlying arrow_schema
call (though that could be extracted to a separate function).
On a more general level, we also currently don't use anything from our table_column
catalog table. When fetching a schema for a given table, such as in information_schema.columns
or when calling TableProvider::schema
somewhere in code (which is what DF uses for information_schema.columns
queries internally as well), we always rely on the Delta table's schema, which is ultimately reconstructed from the logs. The information_schema.columns
in particular will pose a problem at some point, see here
https://github.com/splitgraph/seafowl/blob/40b1158a90121422e66acbc66e4d536f6081b6d7/src/catalog.rs#L285-L293
The solution I outlined in that comment really encompasses adding an ability for bulk-loading Delta table schemas (which would involve changes in delta-rs and probably datafusion). A potentially better solution is for us to thinly wrap the delta table inside our own table and then use our own (bulk-loaded) catalog info in TableProvider::schema
, and only resolve TableProvider::scan
s using the wrapped Delta table. The main drawback there is the potential mismatch/and double tracking of schemas (in our catalog and the delta logs), which might not be that bad.
There's also a minor matter of format; currently we store the fields using the unofficial arrow json representation, while our storage layer has it's own schema/field types. There's also a possibility we'll want to introduce our own field format (to facilitate better compatibility with Postgres?), so wrapping the Delta table in that case would make even more sense.