snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

RDB Shredder: validate that JSONs are conforming to Redshift limits

Open BenFradet opened this issue 7 years ago • 10 comments

Given that only RDB Loader has knowledge of the targeted database, it makes sense that it enforces the database limits (e.g. 4mb for JSONs in Redshift).

BenFradet avatar Dec 11 '17 14:12 BenFradet

Hm, not sure it is possible in RDB Loader. Loader has an access only to folder structure.

chuwy avatar Dec 11 '17 14:12 chuwy

Can't we try recovering from this issue? because it fails the whole load no matter what.

BenFradet avatar Dec 11 '17 14:12 BenFradet

As far as I know Redshift just silently truncates columns that exceeding a limit: https://github.com/snowplow/snowplow-rdb-loader/blob/master/src/main/scala/com/snowplowanalytics/snowplow/rdbloader/loaders/RedshiftLoadStatements.scala#L180

chuwy avatar Dec 11 '17 14:12 chuwy

those errors are not taken into account in max errors

BenFradet avatar Dec 11 '17 14:12 BenFradet

Sorry, I forgot the background on that issue. Is it a whole NDJSON line (entity in shredded type) that breaks a load? Or just a single column/property in a table/schema?

I think that RDB Loader shouldn't be in charge of swallowing/recovering from Redshift exceptions, but instead we should send this event into a bad row during shredding.

chuwy avatar Dec 11 '17 19:12 chuwy

yes but different storage backends will have different properties that's what I meant in the original post.

so in Redshift, a JSON field can't be more than 4mb, what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database.

BenFradet avatar Dec 11 '17 20:12 BenFradet

what about postgres, bigquery, etc. and the shredder doesn't have any coupling regarding database

The lack of coupling in Spark Shred sounds good in theory but I doubt it's practical (this thread is a good reason why).

alexanderdean avatar Dec 11 '17 23:12 alexanderdean

@alexanderdean I think we should prioritize this, it's a huge drag for support atm when they have to recover.

BenFradet avatar Feb 23 '18 09:02 BenFradet

Agree, added to the next+1 milestone

alexanderdean avatar Feb 24 '18 00:02 alexanderdean

Turns out, we already constrain output https://github.com/snowplow/snowplow-rdb-loader/issues/103

chuwy avatar May 29 '18 12:05 chuwy