snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
RDB Shredder: consider disabling validation against JSON Schema
From my experience, enriched data assumes that raw data was not just enriched, but also validated - we never add invalid contexts/unstruct events to final enriched event.
Yet validation is quite compute-heavy process (not compared to distributed IO, but still). So in the end we're wasting our resources doing double-validating on already valid data.
Question here to someone who aware of bad rows. What is the most-common type of errors in shredded/bad bucket.
E.g. Snowflake Transformer does not do any validation and seems quite happy with that.
we never add invalid contexts/unstruct events to final enriched event.
I'm not sure that's true. @BenFradet can confirm if the validation of the overall event including all contexts is the last thing that happens before writing it out.
Custom contexts and unstruct event are validated due enrichment.
Derived contexts are indeed not validated. Though for me it sounds very reasonable to validate derived contexts as well and always have a guarantee that shred job receives valid enriched data.
Isn't removing validation introducing coupling between the two?
Also, somewhat unfortunately, I don't think bypassing validation will save us a lot of time / resources.
Isn't removing validation introducing coupling between the two?
To be honest this kind of coupling is one of the goals I had in mind. I would like to add more meaning to "enriched" state of event. E.g. "enriched" means event is in canonical state, ready for loading/processing and hence it is fully valid and there's no way for validation-related error to appear during shredding/transformation.
I would like to add more meaning to "enriched" state of event
Yes, I understand the intent, it makes sense to me. Not sure how it maps onto this ticket in the short- to mid-term.