snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Redshift Loader: Handle missing columns in CSV
This issue is about schema evolutions which add new columns. There is a problem that arises when the data is transformed using the older schema, but attempted to load using the newer schema.
There are a few situations where this problem could arise:
- When loading historic data. It is a principle of RDB Loader that the transformed archive can be re-used at a future date to populate a new warehouse.
- When re-processing a failed load. Imagine a batch fails to load on 10th January because of a connection failure. The schema is evolved on 11th January. And on 12th January you try re-loading the failed batch.
- A simple race condition between the transformer and loader. Imagine the schema is evolved at approximately the same time that the transformer finishes a batch and writes the SQS message. (It's actually more subtle than this, because of how iglu clients cache schemas, but I think this situation is not uncommon).
In all those cases, the CSV file might have fewer columns than what is expected by the loader/warehouse.
When this happens, the loader logs contain this message:
Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_acme_myschema_1' failed. Check 'stl_load_errors' system table for details.;
If you look up the error in the stl_load_errors
table, then the error message is:
raw_field_value = \\N
error_code = 1214
error_reason = Delimiter not found
The solution is to always specify the columns in the COPY INTO statement. Currently we load a shredded table like this:
COPY com_acme_myschema_1
FROM `s3://mybucket/run=123`
DELIMITER '\t'
whereas we need it look like this:
COPY com_acme_myschema_1 (col1, col2, col3)
FROM `s3://mybucket/run=123`
DELIMITER '\t'
...where the column list refer to the columns that we know are present in the CSV file; not the columns that are present in the table.