ArchiveBot
ArchiveBot copied to clipboard
Create new ignoreset for excluding any kind of non-visible tracking code or analytics code
Crawls of big websites can be slowed down by the sheer amount of tracking code and analytics code crud on every page. It would be nice to have an optional ignoreset we can call that just ignores any of it, since it's not usually visible anyway. So this ignoreset would be kind of like Ghostery or uBlock Origin, but for ArchiveBot.
For example:
- google-analytics.com
- scoreboardresearch.com
Place to look for more examples:
- https://github.com/gorhill/uBlock/wiki/Filter-lists-from-around-the-web
Add in an EU cookie pop-up remover, too?
https://github.com/r4vi/block-the-eu-cookie-shit-list/blob/master/filterlist.txt
Oh, and the various AddThis and ShareThis buttons on websites. We're already blocking some of them, but there are new ones popping up we don't screen for yet. Will update this when I find some more concrete examples.
I'm not sure about this.
- This sort of stuff is a lot of requests, and it slows down grabs. But that's a problem that can be fixed by speeding up ArchiveBot; see e.g. #182.
- We have been grabbing ads, tracking cookies, etc. because it's all part of the page. (It's also fun-depressing to watch a page bloat through the Wayback Machine, and there is something satisfying about click fraud. Assuming ad networks are silly enough to count ArchiveBot crawls as impressions.)
- #182 etc aren't likely to happen for a while, and I guess there is not really much harm in having these sorts of things in a
notracker
ignore set (I would definitely not want them inglobal
). I wouldn't wantnotracker
to become a reflex, though.