materialize
materialize copied to clipboard
Allow avro decoding to (optionally) continue on uninterpretable data.
Feature request
Imagine a writer schema that contains a union of several types. It's reasonable to want to create a view that selects elements where a particular one of the types is inhabited; the way this would be expressed would be by using a reader schema that replaces the union with the concrete type in question.
The avro spec states that this should succeed in schema resolution, but signal an error at decode time when a record that doesn't match the concrete type is detected. That's exactly what we do, but since our mechanism for "signaling an error" is to permanently wedge the source, it's not very useful in practice. Our current behavior is a sane default, but we should slightly bend the spec by introducing an option that tells the decoder to simply ignore non-matching records, making it useful for this kind of "demuxing a union into several views" use-case.
Would we get this for free with a combination of #14133 and #6367?
Yes, but this is about 45000x easier than #14133, and independently useful, so we might want to support it directly rather than waiting for that.
Makes sense. If we do that, let's make sure we put the option next to FORMAT AVRO
CREATE SOURCE
FROM KAFKA CONNECTION ...
FORMAT AVRO (OMIT INVALID UNIONS)
WITH (SIZE 'small')
and not
CREATE SOURCE
FROM KAFKA CONNECTION ...
WITH (SIZE 'small', OMIT INVALID AVRO UNIONS)
Did we get rid of the WITH
options in different places, and now we only use the WITH
keyword on the options that apply to the entire DDL statement?
Did we get rid of the
WITH
options in different places, and now we only use theWITH
keyword on the options that apply to the entire DDL statement?
The code is still a bit of a mess, but any option that is in the wrong place is gated behind unsafe mode.
Just had a painful debugging experience with a prospect that would have been massively easier with a feature like this. The situation was: a Kafka Debezium source with a JSON field that got TOASTed, resulting in a JSON value of __debezium_unavailable_value
rather than valid JSON. The Materialize Avro decoder choked on this. Would have been really nice to just import those invalid JSON values as text
instead of JSON.
I think that’s a slightly different feature than this issue is describing — you’re asking to attempt to decode logical types as the underlying physical type, rather than skipping records for which decoding fails.
Would your debugging session have been easier if we printed out the contents of the field in the error message when JSON decoding fails, or printed out the Kafka offset at which decoding fails?
I think that’s a slightly different feature than this issue is describing — you’re asking to attempt to decode logical types as the underlying physical type, rather than skipping records for which decoding fails.
I guess! Agreed that the existing proposed option is OMIT INVALID AVRO UNIONS
, but both TREAT JSON AS STRING
and OMIT RECORDS WITH BAD JSON
seem like options we could consider adding.
Would your debugging session have been easier if we printed out the contents of the field in the error message when JSON decoding fails, or printed out the Kafka offset at which decoding fails?
Yes and yes. Though the offset wasn't crucial, because we had the key that failed. Would also have helped if the error indicated the name of the field within the record that failed to decode.
Would also have helped if the error indicated the name of the field within the record that failed to decode.
I thought at some point @pH14 did that as one of his first or second tasks. Maybe I’m misremembering the details, or maybe it has bit rotted since then.
This came up again recently with a customer: https://materializeinc.slack.com/archives/CU7ELJ6E9/p1688829511597169?thread_ts=1688680390.639199&cid=CU7ELJ6E9
We're not sure of the exact cause of the uninterpretable data, but we suspect a Debezium JSON field, as we've seen before.