url-tracking-stripper icon indicating copy to clipboard operation
url-tracking-stripper copied to clipboard

More tokens to strip

Open wumpus opened this issue 6 years ago • 2 comments

Hi. I'm a search engine guy, and I'm very interested in a well-tested list of strippable CGI args to reduce the work my crawler has to do. I tried to algorithmicly build a list by taking the top 1000 websites from an old Alexa list, plus a few hosts I care about, and took a sample of their URLs crawled by CommonCrawl, and then counting which cgi args appeared in many of the hosts.

The biggest was &utm_source appearing on 474 of the 1,000 hosts. I dropped everything fewer than 5 hosts. So, in theory, this is somewhat of a representative sample of the most popular ones... although CommonCrawl isn't totally representative of the web, of course.

Here is a list with examples of the ones that aren't currently in your configuration:

# more utm_ -- I think people use utm_ as a prefix for their own purposes and/or Google doesn't document all of them

# https://www.mozilla.org/en-US/firefox/new/?f=30&ref=producthunt&utm_expid=71153379-28.SNKFJ4VqRziIW1TLqjhpAw.1&utm_referrer=https%3A%2F%2Fwww.google.com%2F

utm_expid (15 hosts)
utm_referrer (12 hosts)

# https://www.etsy.com/?utm_source=google&utm_medium=cpc&utm_term=etsy&utm_campaign=search_fr_fr-fr-src-pure-brand-exact-st_exact_etsy&gclid=EAIaIQobChMIk6Duvp6\
n1QIVjantCh1f-whGEAAYASAAEgLsx_D_BwE&gclsrc=aw.ds

gclsrc 22 hosts

# https://www.google.fr/chrome/browser/features.html?brand=CHBD&gclid=CN6B2tjusdECFVAQ0wodfmcISw&dclid=CM6vjtnusdECFcSjUQodyg4B2Q

dclid 21 hosts {similar to gclid?}

normally cookies

# Adobe ColdFusion
# https://techcrunch.com/?CFID=8494701&CFTOKEN=56974155

&CFID= 25 hosts, 70 total instances
&CFTOKEN= 25 hosts, 70 total instances

# PHP
# http://instagram.com/p/BUPpEcIDFjT/?PHPSESSID=dbj4v5fl2c6sd8f8986aprqpf3

&PHPSESSID= 5 hosts, 89 total instances

and here are the popular ones that you don't have at all:

# Web Trends

# http://www.nature.com/collections/dtfkmdgglg?WT.mc_id=SFB_NA_1017_FattyLiverGraphic
# https://www.microsoft.com/en-us/store/b/accessories?tid=vpOCJmmq&cid=5250&pcrid=3050714533&pkw=makerbot%20replicator%202%20desktop%203d%20printer&pmt=e&WT.srch=1&WT.mc_id=pointitsem_Microsoft+US_bing_5+-+Accessories&WT.term=make
# https://www.chase.com/ccp/index.jsp?pg_name=ccpmapp/shared/assets/page/repayment_examples&WT.ac=st_ctr_student&jp_aid=st_ctr_student&WT.mc_id=st_ctr_student_repayment&jp_mep=st_ctr_student_repayment&WT.pn_sku=repayment_plans&memberid=studentcenter
# https://www.intuit.com/company/press-room/press-releases/2013/QuickenPullsBacktheCoversonLoveandMoney/?WT.qs_osrc=TST-164886110

&WT.mc_id= 24 hosts, 2530 total instances
&WT.srch= 14 hosts, 422 total instances
&WT.ac= 8 hosts, 4094 total instances
&WT.qs_osrc= 5 hosts, 20 total instances
&WT.pn_sku

# Oracle Eloqua

# http://www.cray.com/company/policies-and-practices/privacy-policy?elqTrackId=2e97d2d4f56e41eb9498379bab9753db&elqaid=584&elqat=2
# http://www.blackboard.com/Platforms/Collaborate/Resources/Webinars-and-Demos.aspx?elq=a318adfc3e7e40de83e0883a1d6760ba&elqCampaignId=329

&elqTrackId= 12 hosts, 191 total instances
&elqaid= 12 hosts, 189 total instances
&elqat= 12 hosts, 189 total instances
&elqCampaignId= 7 hosts, 138 total instances
&elq= 7 hosts, 111 total instances

# comScore Digital Analytix:

# http://www.dailymail.co.uk/sport/rugbyunion/article-5082539/France-23-28-New-Zealand-Blacks-French.html?ITO=1490&ns_mchannel=rss&ns_campaign=1490
# http://www.hotstar.com/tv/cineplay/13080?ns_mchannel=Article&ns_source=Scroll&ns_campaign=Cineplay&ns_linkname=CineplayShowPage&ns_fee=0

&ns_campaign= 6 hosts, 97 total instances
&ns_mchannel= 5 hosts, 92 total instances
&ns_source=
&ns_linkname=
&ns_fee=

# suspicious but probably too generic

# https://www.cray.com/?leadsource=website&srcdes=seagate&campaign=7010b0000018kLW
&campaign= 15 hosts, 9072 total instances

# https://wordpress.com/create/?utm_source=bing&utm_campaign=WordPress-Generic-Exact-US-GP&utm_medium=cpc&keyword=wordpress&creative=9925335912&campaignid=12806\
5278&adgroupid=3099786316&matchtype=e&device=c&network=o
&campaignid= 6 hosts, 74 total instances

wumpus avatar May 16 '18 20:05 wumpus

Hi @wumpus and thanks for the issue and excellent supporting data! Some of these look for sure like no-brainers to add to the core set of trackers to block, while others look a little more dangerous.

I'm in the midst of working on a system to allow users to add/remove their own trackers, in which case I'd be far more willing to put many of these into the defaults. If I get stalled out on that update, I'll probably just add them to a minor update when I get an hour or so to play with and test them.

If you don't see any motion on this in a week or so, please prod me. Thanks again!

newhouse avatar May 16 '18 20:05 newhouse

Just noticed this one, a little googling says it's been around for a while, and that it's common enough that some reddit subs have banned using it:

https://www.youtube.com/attribution_link?a=dRBqlLWtf5U&u=%2Fwatch%3Fv%3Dpogq2tZFKKo%26feature%3Dshare

It's not just a token to strip, though. Normally only Amazon designs urls this poorly!

wumpus avatar May 21 '18 05:05 wumpus