JustAnotherArchivist
JustAnotherArchivist
Well, it is, since it got regenerated in the target item. But the metadata doesn't reflect that. That would be a different bug though, I'd think.
That item has been darked since, so it's now impossible to reproduce. Is there another known example?
Here's a list of things that I think warcio should validate before even emitting an `ArcWarcRecord`: - [ ] Does the stream begin with `WARC/`, a supported spec version, and...
I took a quick look how this could be implemented. warcio uses the same code for WARC and HTTP header parsing, `warcio.statusandheaders.StatusAndHeadersParser`. Unfortunately, HTTP servers are horrible at complying with...
Right. But *should* the iterator be resumable if the underlying stream is not a valid WARC file like in the examples above? For digest verification, it makes sense to log...
The `Content-Type` header is optional, so omitting it would be one option. `application/octet-stream` also seems sensible to me. WARC is a byte-oriented file format, so any payload must also be...
I was bit by this recently as well. In my tool, I'm writing a log file (using `logging`). Before the process exits, I [copy this log file into the WARC](https://git.kiska.pw/JustAnotherArchivist/qwarc/src/tag/v0.2.5/qwarc/warc.py#L192-L203)...
@ikreymer Indeed, short of making a full copy (as `ensure_digest` does when the length is unknown) with its significant performance and disk space impact for large records, it's impossible to...
The whitespace on the line with the `field-name` has never been significant semantically as far as I know. Neither the whitespace after the colon nor the one at the end...
Yeah, it definitely would. Possibly some comparisons to the tars on PyPI as well. I haven't done anything regarding this so far.