snowplow-rdb-loader
snowplow-rdb-loader copied to clipboard
Common: simplify configuration
Migrated from https://github.com/snowplow/snowplow/issues/3279
Needs further thought where we go here.
Example JSON with all required properties:
{
"schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
"data": {
"buckets": {
"shreddedGood": "s3://snowplow-acme-processing/shredded/good/",
"log": "s3://snowplow-acme-processing/logs/",
"jsonpathAssets": "s3://snowplow-acme-assets/optional/"
},
"outputCompression": "GZIP",
"rdbShredder": "0.13.0",
"tracking": {
"method": "POST",
"appId": "acme-loading",
"collector": "collector.acme.com:8080"
}
}
}
buckets.jsonpathAssets
and tracking
are optional.
Uses own vendor following dataflow runner and factotum examples.
I think this is a good start! Comments:
- Doesn't
outputCompression
relate to the contents ofshreddedGood
? - Shouldn't some of this be nested into a Redshift-specific sub-object?
Doesn't outputCompression relate to the contents of shreddedGood?
Yep, probably. But as much as rdbShredder
.
Shouldn't some of this be nested into a Redshift-specific sub-object?
Just jsonpathAssets
, I believe? But for me JSONPaths a) although used only in Redshift - can be considered general as general tool; b) can be removed at all in future - I wouldn't consider them important enough to re-structure property that can be simple string - into complex object.
Another example could look like:
{
"schema": "iglu:com.snowplowanalytics.rdbloader/config/jsonschema/1-0-0",
"data": {
"shredded": {
"bucket": "s3://snowplow-acme-processing/shredded/good/",
"compression": "GZIP",
"shredder": "0.13.0"
},
"jsonpaths": "s3://snowplow-acme-assets/optional/",
"logs": "s3://snowplow-acme-processing/logs/",
"tracking": {
"method": "POST",
"appId": "acme-loading",
"collector": "collector.acme.com:8080"
}
}
}
Thanks @chuwy - I like the suggested alternative, it feels like a more coherent grouping...
This should be a Shredder/Loader common configuration. As opposed to our target configuration, which is a low-level connection description - this is an application configuration.
Same should be done for strawberry: right now single config (that supposed to describe connection) includes unnecessary details about manifest and enriched archive.
Totally agree. It's worth looking at the "other" RT apps, like the Scala Stream Collector and Stream Enrich. That has to be the direction of travel...
Pushing this back as it'll require associated Snowplow/EmrEtlRunner release.
Though, we can implement it in backward-compatible way, e.g. new Loader can understand following options:
-
--target-config $TARGET_JSON
and--loader-config $LOADER_JSON
for upcoming EER -
--config $CONFIG_YML
with deprecation warning for older/current EER
EmrEtlRunner also has to decide what format to use.
Yes I like that approach. Good to push back though.