heritrix3 Avoid speculative links extraction for meta fields known not to contain links

trafficstars

Following this report of a URL being constructed from <meta> elements:

I'm using heritrix 3.3.0-SNAPSHOT and see some strange behavior in the link extraction. This is one example in crawl.log:
2018-12-21T04:07:03.874Z   404       7161 https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com RLX https://stitch-maps.com/news/2018/10/twofer/ text/html #116 20181219040702090+1782 sha1:K7HLTQ7SFI4KAQN3NVAO4OJ4UBYT3FGE - -
There isn't any link to the crawled url on the given src page, so it seems like the facebook tags on the src page have something to do with it:
<meta property="og:url" content="http://stitch-maps.com/news/2018/10/twofer/"/>
<meta property="og:site_name" content="Stitch-Maps.com"/>
Isn't it a bug, that heritrix combined these two urls to https://stitch-maps.com/news/2018/10/twofer/Stitch-Maps.com?

Jan 03 '19 21:01 anjackson

However, looking at the code in question, it appears that the ExtractorHTML extracts links that might be URLs from any <meta content="..." attribute except for property="robots" or property="refresh":

https://github.com/internetarchive/heritrix3/blob/a83167619604926b1c8aebfef5e21271ad64eeaa/modules/src/main/java/org/archive/modules/extractor/ExtractorHTML.java#L990-L996

I think, in general this won't happen with textual content attributes, but in this case the domain-name form appears to be causing this to be judged isVeryLikelyUri(...) == true.

https://github.com/internetarchive/heritrix3/blob/05811705ed996122bea1f4e034c1ed5f7a07240f/commons/src/main/java/org/archive/util/UriUtils.java#L394-L469

Hence, I'm not sure how often this problem will really turn up - it may not be worth worrying about.

However, for common properties that are known not to be used for absolute or relative URLs of any sort, the ExtractorHTML class could be modified to skip this speculative link extraction.

Jan 03 '19 21:01 anjackson

Apparently this happens a lot with og:facebook-tags attributes.

Perhaps given the change in usage of these fields in recent years, it's time to change the default behaviour to avoid this speculative link extraction?

Jan 08 '19 11:01 anjackson

This really happens very often and would be a great fix saving a lot of bandwidth and trouble. E. g. when crawling www.klausenstein.at an automatic abuse-report by this host is created because of this line in the page src:

<meta name="publisher" content="iNetWorker.at"/>

This causes heritrix to request http://www.klausenstein.at/iNetWorker.at which is interpreted as a crawler-trap and results in an abuse-report. We faced lots of similar situations with something like <meta name="publisher" content="domain.com"/> ...

Jan 20 '20 11:01 ToRu82

Unfortunately the problems are increasing more and more, this tag also causes problems:

<meta name="twitter:domain" content="Drivingthenation.com" />

It is placed on every page of the domain and generates an additional invalid call (404) of the form "current URL + Drivingthenation.com" for every single page request, which leads to thousands of additional invalid requests with 404 return code. For instance www.drivingthenation.com/category/automobilesandenergy/ "links" to www.drivingthenation.com/category/automobilesandenergy/Drivingthenation.com and so on. But all these "linked" pages do not exist.

It would be very helpful if a solution could be found for this problem in the near future. These incorrectly extracted URLs lead to great frustration for webmasters. It's always the content="domain.com" attribute which most likely is never a link!?

Feb 14 '20 14:02 ToRu82

In my opinion, this URL guessing approach by parsing javascript content must die completely. This easily causes hundreds of RPM of not found errors, which often triggers alerts. Whoever thought that this is a good approach has probably never hosted or monitored anything.

Aug 21 '20 06:08 mvaitkus

heritrix3 heritrix3 copied to clipboard

Avoid speculative links extraction for meta fields known not to contain links

heritrix3
heritrix3 copied to clipboard