snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Transformer: fix the "skip" feature for skipping schemas
The transformer config file has a skip
option which is documented as:
# Schemas that won't be loaded
# Optional, default value []
"skip": [
"iglu:com.acme/skip-event/jsonschema/1-*-*"
]
If you add a schema to this array in the config file, then the transformer transforms it to Json instead of Tsv, which is not what I expected! The resulting SQS message lists the schema in the types
array:
"types": [
{
"schemaKey": "iglu:com.acme/skip-event/jsonschema/1-0-0",
"format": "JSON",
"snowplowEntity": "SELF_DESCRIBING_EVENT"
},
// etc
The loader sees the type in the SQS message and tries to load the json file. But if there is no valid jsonpath for this schema then loading fails.
In other words, this skip
feature is completely broken.
Fix or remove the feature?
On the one hand, nobody can be using this feature because I think it's been broken for a long time. So we could just remove the feature and it won't upset anyone.
On the other hand.... I guess the skip
feature exists as a protection against un-loadable schemas. We know this can be a problem for redshift and databricks loading if the schema has broken the rules of non-breaking versioning. We could fix it so that the transformer ignores schemas in the array instead of transforming to json.