snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
RDB Shredder: validate that JSONs are conforming to Redshift limits
Given that only RDB Loader has knowledge of the targeted database, it makes sense that it enforces the database limits (e.g. 4mb for JSONs in Redshift).
Hm, not sure it is possible in RDB Loader. Loader has an access only to folder structure.
Can't we try recovering from this issue? because it fails the whole load no matter what.
As far as I know Redshift just silently truncates columns that exceeding a limit: https://github.com/snowplow/snowplow-rdb-loader/blob/master/src/main/scala/com/snowplowanalytics/snowplow/rdbloader/loaders/RedshiftLoadStatements.scala#L180
those errors are not taken into account in max errors
Sorry, I forgot the background on that issue. Is it a whole NDJSON line (entity in shredded type) that breaks a load? Or just a single column/property in a table/schema?
I think that RDB Loader shouldn't be in charge of swallowing/recovering from Redshift exceptions, but instead we should send this event into a bad row during shredding.
yes but different storage backends will have different properties that's what I meant in the original post.
so in Redshift, a JSON field can't be more than 4mb, what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database.
what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database
The lack of coupling in Spark Shred sounds good in theory but I doubt it's practical (this thread is a good reason why).
@alexanderdean I think we should prioritize this, it's a huge drag for support atm when they have to recover.
Agree, added to the next+1 milestone
Turns out, we already constrain output https://github.com/snowplow/snowplow-rdb-loader/issues/103