gcp-ingestion
gcp-ingestion copied to clipboard
Parse channel from activity-stream pings
Impression-stats and other docTypes have a release
top-level field that should be used in the pipeline as input to normalized_channel. Currently, they have null normalized_channel.
We should probably encode this in JSON schemas under mozPipelineMetadata as a new normalized_channel_source
field and have the pipeline use that to decide where to look.
We also have cases where we want to use a static value for channel. For Fenix, we codify the value for app_channel
in https://github.com/mozilla/probe-scraper/blob/main/repositories.yaml
We could represent that in the generated JSON schemas as a static value.
So perhaps we should have mozPipelineMetadata like the following:
"static_fields": {
"attribute": "normalized_channel",
"static_value": "release"
}
"fields_from_payload": {
"attribute": "normalized_channel"
"source_path": "#/channel"
}
That would make this more generally applicable compared to supporting just normalized_channel
. We'd have to think carefully about the interface and what to call the fields.
Thinking more about interface, this could be cast as attribute_mappings
similar to the existing jwe_mappings
. Each mapping would have a required attribute
field and then either a static_value
or source_path
field.
For a value like normalized_channel
, though, this isn't quite powerful enough. The source_path
would generally be pointing to a "raw" channel identifier; the value still needs to go through the NormalizeAttributes#channel
logic. So I suppose we'd be populating attribute app_update_channel
via this metadata, and we still rely on the pipeline knowing about this as the attribute to use as source for normalization.