snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

Transformer: fix the "skip" feature for skipping schemas

Open istreeter opened this issue 2 years ago • 0 comments

The transformer config file has a skip option which is documented as:

    # Schemas that won't be loaded
    # Optional, default value []
    "skip": [
      "iglu:com.acme/skip-event/jsonschema/1-*-*"
    ]

If you add a schema to this array in the config file, then the transformer transforms it to Json instead of Tsv, which is not what I expected! The resulting SQS message lists the schema in the types array:

      "types": [
        {
          "schemaKey": "iglu:com.acme/skip-event/jsonschema/1-0-0",
          "format": "JSON",
          "snowplowEntity": "SELF_DESCRIBING_EVENT"
        },
// etc

The loader sees the type in the SQS message and tries to load the json file. But if there is no valid jsonpath for this schema then loading fails.

In other words, this skip feature is completely broken.

Fix or remove the feature?

On the one hand, nobody can be using this feature because I think it's been broken for a long time. So we could just remove the feature and it won't upset anyone.

On the other hand.... I guess the skip feature exists as a protection against un-loadable schemas. We know this can be a problem for redshift and databricks loading if the schema has broken the rules of non-breaking versioning. We could fix it so that the transformer ignores schemas in the array instead of transforming to json.

istreeter avatar Jul 29 '22 16:07 istreeter