snowplow-rdb-loader Common: simplify configuration

Migrated from https://github.com/snowplow/snowplow/issues/3279

Aug 17 '17 13:08 chuwy

Needs further thought where we go here.

Sep 20 '17 09:09 alexanderdean

Example JSON with all required properties:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "buckets": {
      "shreddedGood": "s3://snowplow-acme-processing/shredded/good/",
      "log": "s3://snowplow-acme-processing/logs/",
      "jsonpathAssets": "s3://snowplow-acme-assets/optional/"
    },
    "outputCompression": "GZIP",
    "rdbShredder": "0.13.0",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

buckets.jsonpathAssets and tracking are optional.

Uses own vendor following dataflow runner and factotum examples.

Oct 19 '17 11:10 chuwy

I think this is a good start! Comments:

Doesn't outputCompression relate to the contents of shreddedGood?
Shouldn't some of this be nested into a Redshift-specific sub-object?

Oct 19 '17 18:10 alexanderdean

Doesn't outputCompression relate to the contents of shreddedGood?

Yep, probably. But as much as rdbShredder.

Shouldn't some of this be nested into a Redshift-specific sub-object?

Just jsonpathAssets, I believe? But for me JSONPaths a) although used only in Redshift - can be considered general as general tool; b) can be removed at all in future - I wouldn't consider them important enough to re-structure property that can be simple string - into complex object.

Another example could look like:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "shredded": {
      "bucket": "s3://snowplow-acme-processing/shredded/good/",
      "compression": "GZIP",
      "shredder": "0.13.0"
    },
    "jsonpaths": "s3://snowplow-acme-assets/optional/",
    "logs": "s3://snowplow-acme-processing/logs/",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

Oct 22 '17 20:10 chuwy

Thanks @chuwy - I like the suggested alternative, it feels like a more coherent grouping...

Oct 22 '17 20:10 alexanderdean

This should be a Shredder/Loader common configuration. As opposed to our target configuration, which is a low-level connection description - this is an application configuration.

Same should be done for strawberry: right now single config (that supposed to describe connection) includes unnecessary details about manifest and enriched archive.

Dec 21 '17 05:12 chuwy

Totally agree. It's worth looking at the "other" RT apps, like the Scala Stream Collector and Stream Enrich. That has to be the direction of travel...

Dec 22 '17 13:12 alexanderdean

Pushing this back as it'll require associated Snowplow/EmrEtlRunner release.

Jan 08 '18 14:01 chuwy

Though, we can implement it in backward-compatible way, e.g. new Loader can understand following options:

--target-config $TARGET_JSON and --loader-config $LOADER_JSON for upcoming EER
--config $CONFIG_YML with deprecation warning for older/current EER

EmrEtlRunner also has to decide what format to use.

Jan 08 '18 14:01 chuwy

Yes I like that approach. Good to push back though.

Jan 08 '18 14:01 alexanderdean

snowplow-rdb-loader snowplow-rdb-loader copied to clipboard

Common: simplify configuration

snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard