Alex Osborne
Alex Osborne
Another idea is perhaps the full scope should be re-evaluated after the response header is received. This would mean putting a content type decide rule in the normal scope would...
Unfortunately builds.archive.org is not currently publicly available over HTTPS. I merged the first part of this though in bc6a15b41dd7d5b48f51b8ee5fb40884717df276.
Since we haven't heard back from IA on this, here's one idea for a short term solution that doesn't require any new infrastructure: #433. In the long term let's maybe...
This seems reasonable but I'm not familiar enough with this module to confidently review this. I'm hoping someone more knowledgeable about this comments, otherwise I'll merge it in a few...
I don't fully understand the question but Heritrix doesn't currently have a fetch module for `data:` URIs so I think adding them to scope would currently do nothing. I don't...
If the sites you are trying crawl cannot be resolved through (local) DNS then Heritrix is currently unable to archive them. See issue #211 for discussion of the reason for...
Actually re-reading this - the sites you're having problems with are public internet sites? Then the dns-over-https workaround might actually work for you.
I've seen [some speculation](https://access.redhat.com/security/cve/CVE-2021-44228) that the log4j 1 JMS appender may also be vulnerable but this would require a Heritrix user to have explicitly configured it. Note that software widely...
Adding the following to ExtractorHtmlTest: ```java public void test() throws IOException { String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt"; CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url)); String content = IOUtils.toString(new URL(url).openStream()); getExtractor().extract(curi, content); CrawlURI[] links...
Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?