snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

RDB Loader: automatic maxError correction

Open chuwy opened this issue 3 years ago • 3 comments

We have a very common kind of an error caused by schemaing mistake that results in std load error. What we usually do is either:

  1. Notifying the owner asking to fix the underlying issue or
  2. Increasing maxError so that pipeline can proceed

Very often it's 2nd option if we cannot do anything or pipeline has to keep running.

What I propse is to introduce two kinds of maxError setting:

  1. maxErrorAlert - the lower bound (i.e. 1), if it has reached the batch isn't dropped, but instead we generate a warning, but try to increse the Redshift maxError up to maxErrorStop
  2. maxErrorStop - if this upper bound has reached we actually want to hard stop the loader until underlying problem is fixed

To imitate current behavior these settings can set to the same value.

chuwy avatar May 17 '21 11:05 chuwy

maxError is for one SQS message ? Is the number of errors stored in the loader memory ?

benjben avatar May 17 '21 11:05 benjben

No, this is passed directly to Redshift (https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-load.html#copy-maxerror):

COPY table_name FROM "s3://data/" MAXERROR 10;

Now if we've ran into 11 corrupted lines (events) the loading fails and Support has to decide on their action. With this ticket implemented we'd be running it with MAXERROR $maxErrorAlert every time, but then if we reached maxErrorAlert + 1 instead of crashing the app and abandoning the folder we keep retrying this single folder until we reach maxErrorStop

chuwy avatar May 17 '21 11:05 chuwy

I see, thanks for the explanation!

benjben avatar May 17 '21 11:05 benjben