heritrix3
heritrix3 copied to clipboard
Avoid speculative links extraction for meta fields known not to contain links
Following this report of a URL being constructed from <meta> elements:
I'm using heritrix 3.3.0-SNAPSHOT and see some strange behavior in the link extraction. This is one example in crawl.log:
2018-12-21T04:07:03.874Z 404 7161 https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com RLX https://stitch-maps.com/news/2018/10/twofer/ text/html #116 20181219040702090+1782 sha1:K7HLTQ7SFI4KAQN3NVAO4OJ4UBYT3FGE - -There isn't any link to the crawled url on the given src page, so it seems like the facebook tags on the src page have something to do with it:
<meta property="og:url" content="http://stitch-maps.com/news/2018/10/twofer/"/> <meta property="og:site_name" content="Stitch-Maps.com"/>Isn't it a bug, that heritrix combined these two urls to https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com?
However, looking at the code in question, it appears that the ExtractorHTML extracts links that might be URLs from any <meta content="..." attribute except for property="robots" or property="refresh":
https://github.com/internetarchive/heritrix3/blob/a83167619604926b1c8aebfef5e21271ad64eeaa/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java#L990-L996
I think, in general this won't happen with textual content attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true.
https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/commons/src/main/java/org/archive/util/UriUtils.java#L394-L469
Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.
However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML class could be modified to skip this speculative link extraction.
Apparently this happens a lot with og:facebook-tags attributes.
Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?
This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:
<meta name="publisher" content="iNetWorker.at"/>
This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like
<meta name="publisher" content="domain.com"/> ...
Unfortunately the problems are increasing more and more, this tag also causes problems:
<meta name="twitter:domain" content="Drivingthenation.com" />
It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.
It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?
In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.