materialize icon indicating copy to clipboard operation
materialize copied to clipboard

Allow avro decoding to (optionally) continue on uninterpretable data.

Open umanwizard opened this issue 2 years ago • 3 comments

Feature request

Imagine a writer schema that contains a union of several types. It's reasonable to want to create a view that selects elements where a particular one of the types is inhabited; the way this would be expressed would be by using a reader schema that replaces the union with the concrete type in question.

The avro spec states that this should succeed in schema resolution, but signal an error at decode time when a record that doesn't match the concrete type is detected. That's exactly what we do, but since our mechanism for "signaling an error" is to permanently wedge the source, it's not very useful in practice. Our current behavior is a sane default, but we should slightly bend the spec by introducing an option that tells the decoder to simply ignore non-matching records, making it useful for this kind of "demuxing a union into several views" use-case.

umanwizard avatar Aug 08 '22 22:08 umanwizard

Would we get this for free with a combination of #14133 and #6367?

benesch avatar Aug 09 '22 16:08 benesch

Yes, but this is about 45000x easier than #14133, and independently useful, so we might want to support it directly rather than waiting for that.

umanwizard avatar Aug 09 '22 16:08 umanwizard

Makes sense. If we do that, let's make sure we put the option next to FORMAT AVRO

CREATE SOURCE
FROM KAFKA CONNECTION ...
FORMAT AVRO (OMIT INVALID UNIONS)
WITH (SIZE 'small')

and not

CREATE SOURCE
FROM KAFKA CONNECTION ...
WITH (SIZE 'small', OMIT INVALID AVRO UNIONS)

benesch avatar Aug 10 '22 04:08 benesch

Did we get rid of the WITH options in different places, and now we only use the WITH keyword on the options that apply to the entire DDL statement?

umanwizard avatar Aug 10 '22 19:08 umanwizard

Did we get rid of the WITH options in different places, and now we only use the WITH keyword on the options that apply to the entire DDL statement?

The code is still a bit of a mess, but any option that is in the wrong place is gated behind unsafe mode.

benesch avatar Aug 11 '22 04:08 benesch

Just had a painful debugging experience with a prospect that would have been massively easier with a feature like this. The situation was: a Kafka Debezium source with a JSON field that got TOASTed, resulting in a JSON value of __debezium_unavailable_value rather than valid JSON. The Materialize Avro decoder choked on this. Would have been really nice to just import those invalid JSON values as text instead of JSON.

benesch avatar Mar 29 '23 01:03 benesch

I think that’s a slightly different feature than this issue is describing — you’re asking to attempt to decode logical types as the underlying physical type, rather than skipping records for which decoding fails.

Would your debugging session have been easier if we printed out the contents of the field in the error message when JSON decoding fails, or printed out the Kafka offset at which decoding fails?

umanwizard avatar Mar 29 '23 06:03 umanwizard

I think that’s a slightly different feature than this issue is describing — you’re asking to attempt to decode logical types as the underlying physical type, rather than skipping records for which decoding fails.

I guess! Agreed that the existing proposed option is OMIT INVALID AVRO UNIONS, but both TREAT JSON AS STRING and OMIT RECORDS WITH BAD JSON seem like options we could consider adding.

Would your debugging session have been easier if we printed out the contents of the field in the error message when JSON decoding fails, or printed out the Kafka offset at which decoding fails?

Yes and yes. Though the offset wasn't crucial, because we had the key that failed. Would also have helped if the error indicated the name of the field within the record that failed to decode.

benesch avatar Mar 29 '23 06:03 benesch

Would also have helped if the error indicated the name of the field within the record that failed to decode.

I thought at some point @pH14 did that as one of his first or second tasks. Maybe I’m misremembering the details, or maybe it has bit rotted since then.

umanwizard avatar Mar 29 '23 06:03 umanwizard

This came up again recently with a customer: https://materializeinc.slack.com/archives/CU7ELJ6E9/p1688829511597169?thread_ts=1688680390.639199&cid=CU7ELJ6E9

We're not sure of the exact cause of the uninterpretable data, but we suspect a Debezium JSON field, as we've seen before.

benesch avatar Jul 10 '23 00:07 benesch