snowplow-javascript-tracker icon indicating copy to clipboard operation
snowplow-javascript-tracker copied to clipboard

Develop and test an enableDarkSocial function

Open yalisassoon opened this issue 8 years ago • 8 comments

Original issue was part of the Snowplow project i.e. was raised a very long time ago before the JS tracker was a standalone project.

The idea is to enable Snowplow users to track social shares the same way that sites like Buzzfeed do them, by:

  1. Adding a element to a URL e.g. a url fragment or querystring parameter
  2. Tracking the URL that is added to the fragment
  3. When a user shares the URL (by any channel, but including social / email / IM / bookmark etc.) fetch the element that is added into its own derived context
  4. At the data modeling step build out the graph of referrals described in the Buzzfeed post

The Snowplow pipeline has evolved significantly since the original issue was raised. My initial suggestion (but only a suggestion - let's iterate the approach in this ticket):

  1. Use the existing page view ID. (We already have a unique ID generated with every page view) as a shareToUrlId
  2. Have a JS process that checks if the current page is a shared page (i.e. there is already a suitably formatted shareToUrlId in the URL as a fragment or name/value pair on the querystring)
  3. If there is not, then add one to the page URL using the HTML5 pushState API and track an addShareIdToUrl event with the page view ID
  4. If there is a URL, then fire a foundShareIdOnUrl event with the relevant ID captured from the URL

In addition we would then have a separate enrichment process that fetches the ID from the URL and loads it into a derived context.

Finally we'd have a step that ran as part of the data modeling that built the sharing graph.

Questions / issues:

  1. How should we modify the URL (querystring or fragment?) Why has Buzzfeed gone with a fragment?
  2. Should we be tracking addShareIdToUrl and readShareIdFromUrl as discrete events? I think this is a good idea, because the alternative - pulling out the ID as an enrichment - and then inferring this as part of the data modeling step - is more fragile: you assume that the ID is appended where it matches the page view ID and is a shared URL otherwise. But maybe that's OK?

cc @fblundun @alexanderdean @richardfergie @kingo55 @msmallcombe

yalisassoon avatar Jan 11 '16 15:01 yalisassoon

Adding it to the fragment is problematic since it may interfere with an existing fragment. For example, in the URL

https://github.com/snowplow/snowplow/wiki/Configuring%20the%20Clojure%20collector#enable-connection-draining-for-your-elastic-beanstalk-instance

any change to the fragment will prevent the browser from jumping to the enable-connection-draining-for-your-elastic-beanstalk-instance element.

Using the querystring is less likely to cause this sort of problem.

fblundun avatar Jan 11 '16 15:01 fblundun

It's great to have this ticket back in the frame! A few observations on how Buzzfeed does it, as the articles on Pound are somewhat vague:

  1. Buzzfeed uses a short hash to identify each sharing node in the tree. This is deliberately a very short string (much shorter than a UUID) to reduce the chances of it being clipped/truncated during dark social sharing
  2. When you click on this URI: http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue, on landing on the page you have the URI rewritten to include the hash, e.g. http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.vaO5pjMGwM
  3. When you then share your URI (including hash) on a social site and somebody clicks on it, on landing on the page they in turn will have the URI rewritten to include a new hash, to indicate a new node in the sharing tree, e.g http://www.buzzfeed.com/catesish/help-am-i-going-insane-its-definitely-blue#.topV28ADPg

Coming up with a hash which is densely packed enough (or can be associated with some other metadata e.g. IP address) to minimize collisions but brief enough to avoid truncation is an interesting challenge...

alexanderdean avatar Jan 11 '16 15:01 alexanderdean

It looks like Buzzfeed are using 9 random characters, each of which can be either a digit or an uppercase or lowercase letter. So there are a total of about 1.4 * 10 ^ 16 strings, so after generating 170,000,000 (i.e. sqrt((62 ** 9) * 2)) such strings you would expect to have about 1 random collision.

fblundun avatar Jan 11 '16 16:01 fblundun

See Modifying a querystring without reloading the page.

Unfortunately this technique looks like it doesn't work for older browsers.

fblundun avatar Jan 11 '16 16:01 fblundun

Of course - you can also do the join using the full URI, not just the hash, so that is plenty of entropy...

alexanderdean avatar Jan 11 '16 16:01 alexanderdean

Here are a couple of articles from Buzzfeed on their solution and what they are able to do with the data collected: http://www.buzzfeed.com/daozers/introducing-pound-process-for-optimizing-and-understanding-n#.arK1yq2by http://www.slideshare.net/g33ktalk/dataengconf-the-science-of-virality-at-buzzfeed

msmallcombe avatar Jan 11 '16 16:01 msmallcombe

I love the tracking hash in the first link!

alexanderdean avatar Jan 11 '16 16:01 alexanderdean

It is worth noting that Buzzfeed used to have their tracking hash on all URLs, including their home page. Recently they made a change to only have it on their content pages.

msmallcombe avatar Jan 11 '16 17:01 msmallcombe