Andy Jackson comments

Results 167 comments of


                                            Andy Jackson

Refetching of failed URLs based on HTTP status codes/content

In case it help, this could be done using an additional `Processor` designed to be placed near the end of the fetch chain. It could check the status and/or the...

Refetching of failed URLs based on HTTP status codes/content

Seems like a reasonable request, so I've tagged this issue. But be aware this project is not very well resourced so I can't guarantee how quickly we'll review things.

Heritrix treats inline images as relative URLs

I've been attempting to create an `ExtractorHTML` test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and...

Heritrix treats inline images as relative URLs

Bump @csrster any more details available?

Maven build fails due to HTTP only upstream servers

As a workaround, it's possible to tell Maven to use the HTTP repo. For GitHub actions, we use this: https://github.com/internetarchive/heritrix3/blob/04f958e987e6c8a3849740cf5ee69fce0a6d1896/.github/workflows/m2-settings.xml We've been talking to IA about updating the Maven endpoint,...

Maven build fails due to HTTP only upstream servers

To remove the dependency on the IA build server, we need at least: ``` com.anotherbigidea:javaswf:jar:CVS-SNAPSHOT-1 com.esotericsoftware:kryo:jar:1.01 com.esotericsoftware:reflectasm:jar:0.8 com.esotericsoftware:minlog:jar:1.2 ``` (this list comes from taking the repo out and building -...

Andy Jackson

Refetching of failed URLs based on HTTP status codes/content

Refetching of failed URLs based on HTTP status codes/content

Heritrix treats inline images as relative URLs

Heritrix treats inline images as relative URLs

Maven build fails due to HTTP only upstream servers

Maven build fails due to HTTP only upstream servers

Merge ait-qa branch to re-syncronised development

Disk usage is not within je.maxDisk or je.freeDisk limits and write operations are prohibited

Bad requests with GTM

Long-lived cookies might have unintended consequences on a crawling session