Alex Osborne
Alex Osborne
This repository is for the Heritrix web crawler software project. To contact the Internet Archive about services on archive.org please refer to the "If you have questions" section on their...
There's a Java port of Google's parser too https://github.com/google/robotstxt-java/ but unfortunately it doesn't seem to be in Maven Central.
OutbackCDX has a partial workaround for this. If you run it with the --omit-self-redirects command-line option (or pass omitSelfRedirects=true in the query string) it will try to use the CDX...
I had an attempt at documenting it while writing a [Java implementation](https://github.com/iipc/jwarc/blob/40a033b99b40fe3394d400c935b57fbf4e531eb9/src/org/netpreserve/jwarc/cdx/CdxRequestEncoder.java). http://iipc.github.io/warc-specifications/guidelines/cdx-non-get-requests/ There's some interesting quirks to it. JSON null is encoded as the string "None". It can also...
Pywb needs this patch to pass them through: https://github.com/nla/pywb/commit/2bb97fc7081a3260d6fdf7f2d248e0dd51dd6129
We run index version 5 it in production and haven't had any issues so far. The only reason it's not the default yet is because the upgrade process needs a...
I've written up some notes about upgrading here: https://github.com/nla/outbackcdx/issues/117 If you have any further OutbackCDX questions please post them there as we've drifted off the topic of the POST canonicalization...
While standardizing the hierarchy by itself may be interesting for other use cases, in order to achieve the two goals that motivated the creation of WACZ the details of the...
~Ah, I think I misunderstood you. You're just saying you'd like to see versioning and fixity as features and suggesting that BagIt or OCFL could be added as structural layers...
On further thought it seems reasonable that one could perform a logical crawl or capture job involving multiple tools so perhaps it'd be better to have the opposite order: `/artifacts/{job}/{tool}/{tool-specific...