snowplow-rdb-loader Redshift Loader: Handle missing columns in CSV

Redshift Loader: Handle missing columns in CSV

Open istreeter opened this issue 1 year ago • 2 comments

This issue is about schema evolutions which add new columns. There is a problem that arises when the data is transformed using the older schema, but attempted to load using the newer schema.

There are a few situations where this problem could arise:

When loading historic data. It is a principle of RDB Loader that the transformed archive can be re-used at a future date to populate a new warehouse.
When re-processing a failed load. Imagine a batch fails to load on 10th January because of a connection failure. The schema is evolved on 11th January. And on 12th January you try re-loading the failed batch.
A simple race condition between the transformer and loader. Imagine the schema is evolved at approximately the same time that the transformer finishes a batch and writes the SQS message. (It's actually more subtle than this, because of how iglu clients cache schemas, but I think this situation is not uncommon).

In all those cases, the CSV file might have fewer columns than what is expected by the loader/warehouse.

When this happens, the loader logs contain this message:

Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_acme_myschema_1' failed.  Check 'stl_load_errors' system table for details.;

If you look up the error in the stl_load_errors table, then the error message is:

raw_field_value = \\N
error_code = 1214
error_reason = Delimiter not found

The solution is to always specify the columns in the COPY INTO statement. Currently we load a shredded table like this:

COPY com_acme_myschema_1
FROM `s3://mybucket/run=123`
DELIMITER '\t'

whereas we need it look like this:

COPY com_acme_myschema_1 (col1, col2, col3)
FROM `s3://mybucket/run=123`
DELIMITER '\t'

...where the column list refer to the columns that we know are present in the CSV file; not the columns that are present in the table.

Sep 14 '22 17:09 istreeter

snowplow-rdb-loader snowplow-rdb-loader copied to clipboard

Redshift Loader: Handle missing columns in CSV

snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard