heritrix3 icon indicating copy to clipboard operation
heritrix3 copied to clipboard

Bad requests with GTM

Open damien-git opened this issue 6 years ago • 4 comments

I have noticed a lot of bad requests from archive.org's crawler on our sites using Google Tag Manager. For instance:

/in.tag/11/10/2024/gtm.start/
/mouseup.dismiss/11/10/2024/gtm.start
/mousedown.dismiss/
/gtm.load/gtm.start/11/10/2024
/json/11/10/2024/gtm.start
/11/10/2024/gtm.js/

These are starting to add noticeable load on the server (which serves many sites).

I understand Heritrix is speculatively trying URLs based on the Javascript code, which is known to sometimes result in 404s. But GTM is used on many websites, so these issues are bad for everybody. Could this speculation be improved to take Google's code into account ? Alternatively, is there a way to disable that speculation with robots.txt ?

damien-git avatar Feb 18 '19 17:02 damien-git

Hi @damien-git, you should probably drop the Internet Archive a note (mailto:[email protected]), as they may be able to tune the behaviour of their crawler.

In general, I personally do not recommend Heritrix users use the speculative JavaScript extractor at all. It seems to cause more trouble than it's worth.

I quite like the idea of tuning the crawl via robots.txt but we should probably look at deprecating or improving the ExtractJS or KnowledgableExtractorJS processors first.

If we can find our which extractor they are using that might help.

anjackson avatar Feb 19 '19 11:02 anjackson

I've run a variant of ExtractorJS for years, that lets me filter out the links it discovers using a set of regular expressions. These are applied before the links are turned into full URLs, making it a bit easier to target common false positives in JS libraries than it would be if we are doing the filtering in the scope. You also don't risk catching any URLs extracted via other (more reliable) means.

Looking at the above, I should probably filter out any links extracted via ExtractorJS containing "gtm."

kris-sigur avatar Feb 20 '19 07:02 kris-sigur

We (Akamai) are seeing a similar issue with sites that have our mPulse product enabled, which includes JavaScript in the page's HTML that looks like this:

var a=["ak.bpcip","ak.cport","..."];

This results in our customer's websites getting crawled by numerous crawlers on each page for those 20+ elements of the array, e.g.:

http://website/foo/bar/ak.bpcip http://website/foo/bar/ak.cport ... etc

nicjansma avatar Apr 18 '19 19:04 nicjansma

+1

I am using google tag manager and crawler is making many requests with "/gtm.js"

poolerMF avatar Dec 31 '21 18:12 poolerMF