enrich
enrich copied to clipboard
Common: add an enrichment extracting canonical properties into dedicated contexts
In order to refactor atomic events we need to extract all non-generic information from a fat table into dedicated contexts and preserve only common properties. As a first step, we can have those properties in atomic event (as we do now, to not break data models) and in their deciated tables/columns (to start writing new data models).
I tried to summarize what contexts and event-specific properties can be extracted out of Event:
app_idplatformetl_tstampcollector_tstampdvce_created_tstampeventevent_idtxn_idname_trackerv_trackerv_collectorv_etluser_iduser_ipaddressuser_fingerprintdomain_useriddomain_sessionidxnetwork_useridgeo_country- MaxMind contextgeo_region- MaxMind contextgeo_city- MaxMind contextgeo_zipcode- MaxMind contextgeo_latitude- MaxMind contextgeo_longitude- MaxMind contextgeo_region_name- MaxMind contextip_isp- MaxMind contextip_organization- MaxMind contextip_domain- MaxMind contextip_netspeed- MaxMind contextpage_url- Web page context (source of truth)page_title- Web page context (source of truth)page_referrer- Referrer context (source of truth)page_urlscheme- Web page contextpage_urlhost- Web page contextpage_urlport- Web page contextpage_urlpath- Web page contextpage_urlquery- Web page contextpage_urlfragment- Web page contextrefr_urlscheme- Referrer contextrefr_urlhost- Referrer contextrefr_urlport- Referrer contextrefr_urlpath- Referrer contextrefr_urlquery- Referrer contextrefr_urlfragment- Referrer contextrefr_medium- Referrer contextrefr_source- Referrer contextrefr_term- Referrer contextmkt_medium- Marketing campaign contextmkt_source- Marketing campaign contextmkt_term- Marketing campaign contextmkt_content- Marketing campaign contextmkt_campaign- Marketing campaign contextcontextsse_category- Struct event self-describing eventse_action- Struct event self-describing eventse_label- Struct event self-describing eventse_property- Struct event self-describing eventse_value- Struct event self-describing eventunstruct_eventtr_orderid- Ecommerce transaction self-describing eventtr_affiliation- Ecommerce transaction self-describing eventtr_total- Ecommerce transaction self-describing eventtr_tax- Ecommerce transaction self-describing eventtr_shipping- Ecommerce transaction self-describing eventtr_city- Ecommerce transaction self-describing eventtr_state- Ecommerce transaction self-describing eventtr_country- Ecommerce transaction self-describing eventti_orderid- Ecommerce transaction item contextti_sku- Ecommerce transaction item contextti_name- Ecommerce transaction item contextti_category- Ecommerce transaction item contextti_price- Ecommerce transaction item contextti_quantity- Ecommerce transaction item contextpp_xoffset_min- Page ping self-describing eventpp_xoffset_max- Page ping self-describing eventpp_yoffset_min- Page ping self-describing eventpp_yoffset_max- Page ping self-describing eventuseragent- Browser context (but populated from different places)br_name- Browser context (but populated from different places) (ua-utils)br_family- Browser context (but populated from different places) (ua-utils)br_version- Browser context (but populated from different places) (ua-utils)br_type- Browser context (but populated from different places) (ua-utils)br_renderengine- Browser context (but populated from different places) (ua-utils)br_lang- Browser context (but populated from different places)br_features_pdf- Browser context (but populated from different places)br_features_flash- Browser context (but populated from different places)br_features_java- Browser context (but populated from different places)br_features_director- Browser context (but populated from different places)br_features_quicktime- Browser context (but populated from different places)br_features_realplayer- Browser context (but populated from different places)br_features_windowsmedia- Browser context (but populated from different places)br_features_gears- Browser context (but populated from different places)br_features_silverlight- Browser context (but populated from different places)br_cookies- Browser context (but populated from different places)br_colordepth- Browser context (but populated from different places)br_viewwidth- Browser context (but populated from different places)br_viewheight- Browser context (but populated from different places)os_name- Browser context (but populated from different places) (ua-utils)os_family- Browser context (but populated from different places) (ua-utils)os_manufacturer- Browser context (but populated from different places)os_timezone- Browser context (but populated from different places)dvce_type- Browser context (but populated from different places) (ua-utils)dvce_ismobile- Browser context (but populated from different places) (ua-utils)dvce_screenwidth- Browser context (but populated from different places)dvce_screenheight- Browser context (but populated from different places)doc_charset- Web page (or document) contextdoc_width- Web page (or document) contextdoc_height- Web page (or document) contexttr_currency- Ecommerce transaction self-describing eventtr_total_base- Ecommerce transaction self-describing eventtr_tax_base- Ecommerce transaction self-describing eventtr_shipping_base- Ecommerce transaction self-describing eventti_currency- Ecommerce transaction item contextti_price_base- Ecommerce transaction item contextbase_currency- Ecommerce transaction self-describing eventgeo_timezone- MaxMind contextmkt_clickid- Marketing campaign contextmkt_network- Marketing campaign contextetl_tagsdvce_sent_tstamprefr_domain_userid- Referrer contextrefr_dvce_tstamp- Referrer contextderived_contextsdomain_sessionidderived_tstampevent_vendorevent_nameevent_formatevent_versionevent_fingerprint- This should remain in canonical eventtrue_tstamp
Their grouping is not very semantic, but should be based mostly on the info source, e.g. although browser/device info semantically is the same information, some of properties are passed thourgh the tracker protocol and some derived through user-agent enrichment.
Contexts
- MaxMind context
- Web page context
- Referrer context
- Marketing campaign context
- Ecommerce transaction item
- Browser/device context (potentially multiple of them)
Self-describing events
- Struct event
- Ecommerce transaction
- Page ping
Common properties
It leaves us with 31 core properties that can be set almost for all events/pipelines. Maybe some of them (user/device identification) can/should be moved into dedicated contexts.
event_id- event identificationapp_id- event identificationevent- eventually will be discarded in favor of vendor/name/versiontxn_id- event identificationevent_vendor- event identificationevent_name- event identificationevent_format- event identificationevent_version- event identificationevent_fingerprint- event identificationplatform- probably should be moved as welldvce_created_tstamp- timestampsdvce_sent_tstamp- timestampscollector_tstamp- timestampsetl_tstamp- timestampsderived_tstamp- timestampstrue_tstamp- timestampsuser_id- user/device identificationuser_ipaddress- user/device identificationuser_fingerprint- user/device identificationdomain_userid- user/device identificationdomain_sessionidx- user/device identificationdomain_sessionid- user/device identificationnetwork_userid- user/device identificationname_tracker- pipeline/auxv_tracker- pipeline/auxv_collector- pipeline/auxv_etl- pipeline/auxetl_tags- pipeline/auxunstruct_event- payloadcontexts- payloadderived_contexts- payload
Migrated from https://github.com/snowplow/snowplow/issues/4244 (comments are auto-generated)
I've created a spreadsheet, proposing what new contexts and events should look like: https://docs.google.com/spreadsheets/d/1UaXrH92IvRWyXNU8wUQ-oxvEI9kJxoxbIcbRjna7RAI/edit#gid=0
@chuwy do you have enrichments config for full atomic schema?
Hi @BioQwer , which config are you refering to ? FYI this issue is still on our roadmap but this has not been prioritized yet.
I work with Open Source version. I have many empty values in atomic columns