ArchiveBot icon indicating copy to clipboard operation
ArchiveBot copied to clipboard

Create new ignoreset for excluding any kind of non-visible tracking code or analytics code

Open Asparagirl opened this issue 9 years ago • 3 comments

Crawls of big websites can be slowed down by the sheer amount of tracking code and analytics code crud on every page. It would be nice to have an optional ignoreset we can call that just ignores any of it, since it's not usually visible anyway. So this ignoreset would be kind of like Ghostery or uBlock Origin, but for ArchiveBot.

For example:

  • google-analytics.com
  • scoreboardresearch.com

Place to look for more examples:

  • https://github.com/gorhill/uBlock/wiki/Filter-lists-from-around-the-web

Asparagirl avatar Dec 08 '15 05:12 Asparagirl

Add in an EU cookie pop-up remover, too?

https://github.com/r4vi/block-the-eu-cookie-shit-list/blob/master/filterlist.txt

Asparagirl avatar Dec 08 '15 05:12 Asparagirl

Oh, and the various AddThis and ShareThis buttons on websites. We're already blocking some of them, but there are new ones popping up we don't screen for yet. Will update this when I find some more concrete examples.

Asparagirl avatar Dec 08 '15 05:12 Asparagirl

I'm not sure about this.

  1. This sort of stuff is a lot of requests, and it slows down grabs. But that's a problem that can be fixed by speeding up ArchiveBot; see e.g. #182.
  2. We have been grabbing ads, tracking cookies, etc. because it's all part of the page. (It's also fun-depressing to watch a page bloat through the Wayback Machine, and there is something satisfying about click fraud. Assuming ad networks are silly enough to count ArchiveBot crawls as impressions.)
  3. #182 etc aren't likely to happen for a while, and I guess there is not really much harm in having these sorts of things in a notracker ignore set (I would definitely not want them in global). I wouldn't want notracker to become a reflex, though.

hannahwhy avatar Dec 09 '15 04:12 hannahwhy