Alex Osborne comments

Results 135 comments of


                                            Alex Osborne

Crawl job stats and reports misleading when excluding PDF-Files (follow up to issue #453)

Another idea is perhaps the full scope should be re-evaluated after the response header is received. This would mean putting a content type decide rule in the normal scope would...

[SECURITY] Use HTTPS to resolve dependencies in Maven Build

Unfortunately builds.archive.org is not currently publicly available over HTTPS. I merged the first part of this though in bc6a15b41dd7d5b48f51b8ee5fb40884717df276.

[SECURITY] Use HTTPS to resolve dependencies in Maven Build

Since we haven't heard back from IA on this, here's one idea for a short term solution that doesn't require any new infrastructure: #433. In the long term let's maybe...

Replace A_TIMESTAMP with A_FETCH_BEGAN_TIME

This seems reasonable but I'm not familiar enough with this module to confidently review this. I'm hoping someone more knowledgeable about this comments, otherwise I'll merge it in a few...

How to add URI scheme to scope (SchemeNotInSetDecideRule)

I don't fully understand the question but Heritrix doesn't currently have a fetch module for `data:` URIs so I think adding them to scope would currently do nothing. I don't...

Heritrix not working behind proxy

If the sites you are trying crawl cannot be resolved through (local) DNS then Heritrix is currently unable to archive them. See issue #211 for discussion of the reason for...

Heritrix not working behind proxy

Actually re-reading this - the sites you're having problems with are public internet sites? Then the dns-over-https workaround might actually work for you.

Impact of log4j CVE-2021-44228 on heritrix3?

I've seen [some speculation](https://access.redhat.com/security/cve/CVE-2021-44228) that the log4j 1 JMS appender may also be vulnerable but this would require a Heritrix user to have explicitly configured it. Note that software widely...

Heritrix treats inline images as relative URLs

Adding the following to ExtractorHtmlTest: ```java public void test() throws IOException { String url = "http://web.archive.org/web/20180830083248id_/http://www.haggmark.dk/solgt/oversigt"; CrawlURI curi = new CrawlURI(UURIFactory.getInstance(url)); String content = IOUtils.toString(new URL(url).openStream()); getExtractor().extract(curi, content); CrawlURI[] links...

Heritrix treats inline images as relative URLs

Yep. Do you mean there's already a pull request for this? I couldn't find it. Could you link it?