JustAnotherArchivist
JustAnotherArchivist
I'm currently uploading a large dataset (159 WARCs, just over 1 TiB) using `ia upload`. Because of the general stability of IA's S3 interface (#176 et al.), I was expecting...
Per WARC/1.0 spec section 5.9: > The payload of an application/http block is its ‘entity-body’ (per [RFC2616]). The entity-body is the HTTP body *without transfer encoding* per [section 4.3 in...
I discovered today that warcio mangles the HTTP header data when it isn't pure ASCII. Specifically, I am dealing with a server that returns ISO-8859-1 headers. As far as I...
warcio fails to parse this valid WARC record correctly: ```python3 import gzip import io import warcio.archiveiterator noise0 = b'WARC/1.1\r\nWARC-Record-ID: \r\nWARC-Type: warcinfo\r\nContent-Length: 0\r\nWARC-Date: 2021-07-04T17:52:55Z\r\n' signal = b'WARC-Filename: "foo\\\nbar"\r\n' noise1 = b'\r\n\r\n\r\n'...
```python import io import warcio output = io.BytesIO() writer = warcio.warcwriter.WARCWriter(output, gzip = False) payload = io.BytesIO() payload.write(b'HTTP/1.1 200 OK\r\nDate: Thu, 27 May 2021 22:03:54 GMT\r\nContent-Length: 0\r\nX-custom: header with two...
`StatusAndHeaders` does not fare well when header fields are repeated. Here is a list of some problems I've found in such cases: * `get_header` always returns the first value. *...
While trying to figure out which version of warcio first included some functionality (for specifying a minimum dependency version), I realised that this repository unfortunately has no version tags. It's...
warcio accepts various WARCs that are not actually valid. There is some validation on the beginning of the content, so it looks like the smallest possible content that passes is...
warcio uses a default `Content-Type` value for WARC records of `application/warc-record`. This MIME type is not documented or specified anywhere; the WARC spec only mentions `application/warc` as the MIME type...
As I understand it, irc-framework currently splits messages based on a fixed maximum length of the actual message. The default value for that [is only 350](https://github.com/kiwiirc/irc-framework/blob/dd9aa2edae8dacfee1f35013158ab7d3529215ef/src/client.js#L56), resulting in too frequent...