JustAnotherArchivist
JustAnotherArchivist
> list of headers which are allowed to be repeated by the HTTP standard Yeah, about that... Here's the relevant part from RFC 2616 (in [section 4.2](https://tools.ietf.org/html/rfc2616#section-4.2)): > Multiple message-header...
From the [discussion I linked](https://github.com/iipc/warc-specifications/issues/22), it seems like the intention was as written, not as implemented. So in that sense, I'd say it's not a bug in the standard. I...
@wumpus Yes, I want to play around with this further. I need to set up a test HTTP server that can produce all the different relevant combinations (unchunked+uncompressed [which should...
@wumpus Sounds great! I remember your request on the wpull repository from a few months ago. Great to hear that this project of yours is coming along nicely. :-) I...
Thanks, will contact you shortly. 1-pywb.warc.gz is arguably the worst because it preserves the Transfer-Encoding header while decoding the body. That means the data consumer will have to detect and...
Looks like 3-warcprox.warc.gz has an incorrect hash for the request record though: it is equal to the block hash when it should be `sha1('')` since there is no request body.
#97 brought this to my attention again. Sorry, I haven't really had time the past few months to investigate further. I'm not sure what the best route forward is. I...
Yeah, I agree that this isn't the place. This issue turned from "warcio is writing invalid digests" to "(nearly) everything is writing standard-violating digests" very quickly. Later I didn't want...
Yep, that's exactly what I meant by "data on how the most common tools behave" above. I did start working on that back in April but then got distracted by...
I just got bitten by this as well. Another related issue is that it isn't possible to read both the raw stream and the decoded/content stream for a record. ---...