Redshift Loader: Handle missing columns in CSV
This issue is about schema evolutions which add new columns. There is a problem that arises when the data is transformed using the older schema, but attempted to load using the newer schema.
There are a few situations where this problem could arise:
- When loading historic data. It is a principle of RDB Loader that the transformed archive can be re-used at a future date to populate a new warehouse.
- When re-processing a failed load. Imagine a batch fails to load on 10th January because of a connection failure. The schema is evolved on 11th January. And on 12th January you try re-loading the failed batch.
- A simple race condition between the transformer and loader. Imagine the schema is evolved at approximately the same time that the transformer finishes a batch and writes the SQS message. (It's actually more subtle than this, because of how iglu clients cache schemas, but I think this situation is not uncommon).
In all those cases, the CSV file might have fewer columns than what is expected by the loader/warehouse.
When this happens, the loader logs contain this message:
Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_acme_myschema_1' failed. Check 'stl_load_errors' system table for details.;
If you look up the error in the stl_load_errors table, then the error message is:
raw_field_value = \\N
error_code = 1214
error_reason = Delimiter not found
The solution is to always specify the columns in the COPY INTO statement. Currently we load a shredded table like this:
COPY com_acme_myschema_1
FROM `s3://mybucket/run=123`
DELIMITER '\t'
whereas we need it look like this:
COPY com_acme_myschema_1 (col1, col2, col3)
FROM `s3://mybucket/run=123`
DELIMITER '\t'
...where the column list refer to the columns that we know are present in the CSV file; not the columns that are present in the table.
We've run into this issue a few times when upgrading schemas. A minor version bump, adding new fields to a schema, results in this error from the RDB loader:
ERROR Loader: Loading of s3://path/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'com_snowplowanalytics_snowplow_mobile_context_1' failed. Check 'stl_load_errors' system table for details.
And shows this error in stl_load_errors because the new fields do not exist in the CSV file.
Delimiter not found
From what I can tell, this is because the events were processed by the RDB transformer prior to the update, and the RDB loader is trying to process them against the new schema.
Currently running both the transformer and loader as docker images on EC2 instances.
snowplow/transformer-kinesis:5.0.0 and snowplow/rdb-loader-redshift:5.0.0
We also have this problem.
CSV file might have fewer columns than what is expected by the loader/warehouse.
[2023-05-17 09:31:04.276+0000] [ERROR] Loader: Loading of s3://my-bucket/transformed/run=2023-05-09-05-34-31/ has failed. Not adding into retry queue. [Amazon](500310) Invalid operation: Load into table 'mobile_idea_open_1' failed. Check 'stl_load_errors' system table for details.; - SqlState: XX000