materialize
materialize copied to clipboard
Implement `avro_decode`
Feature request
We want a function avro_decode(data, reader_schema, writer_schema)
that allows the user to decode arbitrary avro values in the data.
This is weird/hard for a few reasons:
-
The
reader_schema
determines the shape of the output. One possible solution is to just always emit a record type which has the proper fields (determined by the reader schema). But then the user is duplicating the schema information (it exists once in thereader_schema
field, and once in their downstream view that extracts the record fields and does something useful with them). Another option is to makereader_schema
not be data-derived, and instead have a family of functions parameterized on (constant) reader schema, likeavro_decode[reader_schema](data, writer_schema)
which feels like it doesn't match SQL semantics well (@benesch probably has thoughts here). Maybe the functions can be user-created and live in the catalog?CREATE AVRO DECODER my_decode_fn WITH READER SCHEMA ...
-
We also need a CSR source for folks to get their writer schemas from. This will probably involve some hack like polling the CSR every few seconds since AFAIK it doesn't offer a push-based API.
-
The function, though logically a pure function, in practice needs to involve a stateful operator that caches resolved schemas, for performance reasons. This caching should be done at the level of schema IDs, not schemas themselves, to avoid having to check equality of entire schemas on every message.
-
The optimizer needs to be able to turn
SELECT avro_decode[reader_schema](a.data[5:], csr.schema) FROM a JOIN csr ON a.data[1:4] = csr.id
into something that invokes that ID-based stateful caching operator and then passes its result toavro_decode_core(a.data[5:], resolved_schema)
.