Alex Osborne
Alex Osborne
Commit 9d73df3 added support for storing arbitrary extra CDXJ fields using a CBOR-based record encoding when can be enabled with `--index-version 5`. This is still experimental and a little more...
Possibly. I'm going to leave this open as the current matchType=range implementation isn't quite ideal for this use case. I think we'd also want to: * specify an end key...
Oh! Weird. Hm I suppose a way to do that would be to record not just the last applied sequence number but also a count of the number of writebatches...
That's odd. It's supposed to read the input incrementally. That said I think most people have been using it in an incremental fashion with one POST per WARC processed. Therefore...
Also maybe also check on curl's memory usage. I seem to recall curl buffering the input request in memory rather than streaming it, in which case we might also need...
Ah. Yeah, that doesn't look like a Java memory exhaustion (which would be some variant on "OutOfMemoryError") but rather a C++ allocation failing which definitely hints at RocksDB as the...
The first one. CRLF can appear only if immediately followed by SP or HT. This is called line folding. This definition was inherited from the HTTP/1.1 RFC 2616 so you...
Yes. I haven't seen it used in real WARC files in the wild, but a fully compliant parser should support it. From what I've seen, many (but not all) parsers...
I tested this on a site we've been handling trouble with (https://www.aucyberexplorer.com.au/). Unfortunately this site uses JavaScript modules that don't start with `import` or `export` so the `is_module` check doesn't...
The regex used to strip www{number} is [surt.IAURLCanonicalizer._RE_WWWDIGITS](https://github.com/internetarchive/surt/blob/6934c321b3e2f66af9c001d882475949f00570c5/surt/IAURLCanonicalizer.py#L127C1-L127C14). As a workaround it is possible to monkey patch it to to `www\.`: e.g. for cdxj-indexer: ``` $ python3 -c 'import re,...