Andy Jackson
Andy Jackson
In case it help, this could be done using an additional `Processor` designed to be placed near the end of the fetch chain. It could check the status and/or the...
Seems like a reasonable request, so I've tagged this issue. But be aware this project is not very well resourced so I can't guarantee how quickly we'll review things.
I've been attempting to create an `ExtractorHTML` test case for this, and although it does extract the data URI it doesn't seem to use it as a relative path and...
Bump @csrster any more details available?
As a workaround, it's possible to tell Maven to use the HTTP repo. For GitHub actions, we use this: https://github.com/internetarchive/heritrix3/blob/04f958e987e6c8a3849740cf5ee69fce0a6d1896/.github/workflows/m2-settings.xml We've been talking to IA about updating the Maven endpoint,...
To remove the dependency on the IA build server, we need at least: ``` com.anotherbigidea:javaswf:jar:CVS-SNAPSHOT-1 com.esotericsoftware:kryo:jar:1.01 com.esotericsoftware:reflectasm:jar:0.8 com.esotericsoftware:minlog:jar:1.2 ``` (this list comes from taking the repo out and building -...
Okay, I think that's synced up, but with HBase modules restored from the master branch.
Note that running out of space will corrupt any Berkeley DB instance, which is why the defaults assume there is at least 5GB of space available. I would _not_ recommend...
Hi @damien-git, you should probably [drop the Internet Archive a note](https://archive.org/about/contact.php) (mailto:[email protected]), as they may be able to tune the behaviour of their crawler. In general, I personally do not...
Has there been a good analysis of the pros/cons of crawling with or without cookies being enabled? Maybe I should turn them off for most crawls? /cc @kris-sigur @nlevitt