materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Implement `avro_decode`

Open umanwizard opened this issue 2 years ago • 0 comments

Feature request

We want a function avro_decode(data, reader_schema, writer_schema) that allows the user to decode arbitrary avro values in the data.

This is weird/hard for a few reasons:

  • The reader_schema determines the shape of the output. One possible solution is to just always emit a record type which has the proper fields (determined by the reader schema). But then the user is duplicating the schema information (it exists once in the reader_schema field, and once in their downstream view that extracts the record fields and does something useful with them). Another option is to make reader_schema not be data-derived, and instead have a family of functions parameterized on (constant) reader schema, like avro_decode[reader_schema](data, writer_schema) which feels like it doesn't match SQL semantics well (@benesch probably has thoughts here). Maybe the functions can be user-created and live in the catalog? CREATE AVRO DECODER my_decode_fn WITH READER SCHEMA ...

  • We also need a CSR source for folks to get their writer schemas from. This will probably involve some hack like polling the CSR every few seconds since AFAIK it doesn't offer a push-based API.

  • The function, though logically a pure function, in practice needs to involve a stateful operator that caches resolved schemas, for performance reasons. This caching should be done at the level of schema IDs, not schemas themselves, to avoid having to check equality of entire schemas on every message.

  • The optimizer needs to be able to turn SELECT avro_decode[reader_schema](a.data[5:], csr.schema) FROM a JOIN csr ON a.data[1:4] = csr.id into something that invokes that ID-based stateful caching operator and then passes its result to avro_decode_core(a.data[5:], resolved_schema).

umanwizard avatar Aug 08 '22 22:08 umanwizard