snowplow-rdb-loader icon indicating copy to clipboard operation
snowplow-rdb-loader copied to clipboard

Common: simplify configuration

Open chuwy opened this issue 7 years ago • 10 comments

Migrated from https://github.com/snowplow/snowplow/issues/3279

chuwy avatar Aug 17 '17 13:08 chuwy

Needs further thought where we go here.

alexanderdean avatar Sep 20 '17 09:09 alexanderdean

Example JSON with all required properties:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "buckets": {
      "shreddedGood": "s3://snowplow-acme-processing/shredded/good/",
      "log": "s3://snowplow-acme-processing/logs/",
      "jsonpathAssets": "s3://snowplow-acme-assets/optional/"
    },
    "outputCompression": "GZIP",
    "rdbShredder": "0.13.0",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

buckets.jsonpathAssets and tracking are optional.

Uses own vendor following dataflow runner and factotum examples.

chuwy avatar Oct 19 '17 11:10 chuwy

I think this is a good start! Comments:

  • Doesn't outputCompression relate to the contents of shreddedGood?
  • Shouldn't some of this be nested into a Redshift-specific sub-object?

alexanderdean avatar Oct 19 '17 18:10 alexanderdean

Doesn't outputCompression relate to the contents of shreddedGood?

Yep, probably. But as much as rdbShredder.

Shouldn't some of this be nested into a Redshift-specific sub-object?

Just jsonpathAssets, I believe? But for me JSONPaths a) although used only in Redshift - can be considered general as general tool; b) can be removed at all in future - I wouldn't consider them important enough to re-structure property that can be simple string - into complex object.

Another example could look like:

{
  "schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
  "data": {
    "shredded": {
      "bucket": "s3://snowplow-acme-processing/shredded/good/",
      "compression": "GZIP",
      "shredder": "0.13.0"
    },
    "jsonpaths": "s3://snowplow-acme-assets/optional/",
    "logs": "s3://snowplow-acme-processing/logs/",
    "tracking": {
      "method": "POST",
      "appId": "acme-loading",
      "collector": "collector.acme.com:8080"
    }
  }
}

chuwy avatar Oct 22 '17 20:10 chuwy

Thanks @chuwy - I like the suggested alternative, it feels like a more coherent grouping...

alexanderdean avatar Oct 22 '17 20:10 alexanderdean

This should be a Shredder/Loader common configuration. As opposed to our target configuration, which is a low-level connection description - this is an application configuration.

Same should be done for strawberry: right now single config (that supposed to describe connection) includes unnecessary details about manifest and enriched archive.

chuwy avatar Dec 21 '17 05:12 chuwy

Totally agree. It's worth looking at the "other" RT apps, like the Scala Stream Collector and Stream Enrich. That has to be the direction of travel...

alexanderdean avatar Dec 22 '17 13:12 alexanderdean

Pushing this back as it'll require associated Snowplow/EmrEtlRunner release.

chuwy avatar Jan 08 '18 14:01 chuwy

Though, we can implement it in backward-compatible way, e.g. new Loader can understand following options:

  • --target-config $TARGET_JSON and --loader-config $LOADER_JSON for upcoming EER
  • --config $CONFIG_YML with deprecation warning for older/current EER

EmrEtlRunner also has to decide what format to use.

chuwy avatar Jan 08 '18 14:01 chuwy

Yes I like that approach. Good to push back though.

alexanderdean avatar Jan 08 '18 14:01 alexanderdean