snowplow-rdb-loader RDB Shredder: validate that JSONs are conforming to Redshift limits

RDB Shredder: validate that JSONs are conforming to Redshift limits

Open BenFradet opened this issue 7 years ago • 10 comments

Given that only RDB Loader has knowledge of the targeted database, it makes sense that it enforces the database limits (e.g. 4mb for JSONs in Redshift).

Dec 11 '17 14:12 BenFradet

Hm, not sure it is possible in RDB Loader. Loader has an access only to folder structure.

Dec 11 '17 14:12 chuwy

Can't we try recovering from this issue? because it fails the whole load no matter what.

Dec 11 '17 14:12 BenFradet

As far as I know Redshift just silently truncates columns that exceeding a limit: https://github.com/snowplow/snowplow-rdb-loader/blob/master/src/main/scala/com/snowplowanalytics/snowplow/rdbloader/loaders/RedshiftLoadStatements.scala#L180

Dec 11 '17 14:12 chuwy

those errors are not taken into account in max errors

Dec 11 '17 14:12 BenFradet

Sorry, I forgot the background on that issue. Is it a whole NDJSON line (entity in shredded type) that breaks a load? Or just a single column/property in a table/schema?

I think that RDB Loader shouldn't be in charge of swallowing/recovering from Redshift exceptions, but instead we should send this event into a bad row during shredding.

Dec 11 '17 19:12 chuwy

yes but different storage backends will have different properties that's what I meant in the original post.

so in Redshift, a JSON field can't be more than 4mb, what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database.

Dec 11 '17 20:12 BenFradet

what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database

The lack of coupling in Spark Shred sounds good in theory but I doubt it's practical (this thread is a good reason why).

Dec 11 '17 23:12 alexanderdean

@alexanderdean I think we should prioritize this, it's a huge drag for support atm when they have to recover.

Feb 23 '18 09:02 BenFradet

Agree, added to the next+1 milestone

Feb 24 '18 00:02 alexanderdean

Turns out, we already constrain output https://github.com/snowplow/snowplow-rdb-loader/issues/103

May 29 '18 12:05 chuwy

snowplow-rdb-loader snowplow-rdb-loader copied to clipboard

RDB Shredder: validate that JSONs are conforming to Redshift limits

snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard