snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
RDB Loader: automatic maxError correction
We have a very common kind of an error caused by schemaing mistake that results in std load error. What we usually do is either:
- Notifying the owner asking to fix the underlying issue or
- Increasing
maxError
so that pipeline can proceed
Very often it's 2nd option if we cannot do anything or pipeline has to keep running.
What I propse is to introduce two kinds of maxError
setting:
-
maxErrorAlert
- the lower bound (i.e.1
), if it has reached the batch isn't dropped, but instead we generate a warning, but try to increse the RedshiftmaxError
up tomaxErrorStop
-
maxErrorStop
- if this upper bound has reached we actually want to hard stop the loader until underlying problem is fixed
To imitate current behavior these settings can set to the same value.
maxError
is for one SQS message ? Is the number of errors stored in the loader memory ?
No, this is passed directly to Redshift (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html#copy-maxerror):
COPY table_name FROM "s3://data/" MAXERROR 10;
Now if we've ran into 11 corrupted lines (events) the loading fails and Support has to decide on their action. With this ticket implemented we'd be running it with MAXERROR $maxErrorAlert
every time, but then if we reached maxErrorAlert + 1
instead of crashing the app and abandoning the folder we keep retrying this single folder until we reach maxErrorStop
I see, thanks for the explanation!