Invalid WARCs are silently accepted instead of raising an error
warcio accepts various WARCs that are not actually valid. There is some validation on the beginning of the content, so it looks like the smallest possible content that passes is just WARC/1.0 (or another version), without even a line termination. Truncations within headers or payload are also silently ignored. As a result, such mutilated files also pass warcio check:
$ echo -n 'WARC/1.0' | gzip >test.warc.gz
$ zcat test.warc.gz | xxd
00000000: 5741 5243 2f31 2e30 WARC/1.0
$ warcio check -v test.warc.gz
test.warc.gz
offset 0 WARC-Record-ID None None
no digest to check
Here's a more thorough example. Every single one of these samples is invalid and should raise an exception on parsing (although arguably for some of these, namely the ones where the truncation occurs at the beginning of or within the payload, it should only happen on trying to read the raw or decoded stream):
import gzip
import io
import warcio.archiveiterator
samples = [
b'WARC/1.0',
b'WARC/1.0\r\n',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\n',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\n',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\nfo',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\nfoo',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\nfoo\r',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\nfoo\r\n',
b'WARC/1.0\r\nWARC-Record-ID: <urn:uuid:51880269-0041-4f8f-ac34-57c75c151871>\r\nContent-Length: 3\r\nWARC-Date: 2021-01-05T19:04:12Z\r\nWARC-Type: resource\r\n\r\nfoo\r\n\r',
]
for sample in samples:
f = io.BytesIO(gzip.compress(sample))
it = iter(warcio.archiveiterator.WARCIterator(f))
try:
r = next(it)
except:
# Something should be raised
pass
else:
print(f'No error raised on {sample!r}')
Here's a list of things that I think warcio should validate before even emitting an ArcWarcRecord:
- [ ] Does the stream begin with
WARC/, a supported spec version, and CRLF? - [ ] Does every following line end with CRLF?
- [ ] Once an empty line has been read: Is every mandatory header (
WARC-Record-ID,Content-Length,WARC-Date, andWARC-Type) present with a valid value?
Then, upon reading from raw_stream (directly or via content_stream):
- [ ] Can enough bytes be read from the stream, in accordance with the
Content-Lengthheader? - [ ] Is that data followed by CRLFCRLF?
If any of these points fail, an exception should be raised.
As far as I can tell, only the first point is partially (no CRLF check) implemented so far.
I took a quick look how this could be implemented. warcio uses the same code for WARC and HTTP header parsing, warcio.statusandheaders.StatusAndHeadersParser. Unfortunately, HTTP servers are horrible at complying with the specifications, but some clients (browsers) 'fix' those inconsistencies instead of throwing bricks at the developer, and so other clients are forced to also be flexible when it comes to parsing HTTP (e.g. wrong linebreaks). Such flexibility is undesirable for WARC parsing in my opinion.
I'll take a shot at adding a strict kwarg to the parser which requires CRLF line endings, UTF-8 encoding without fallback to ISO-8859-1, colons except in continuation lines, etc. Then this can be enabled for the WARC parsing but not the HTTP one. I'm not sure about the name of that kwarg since at least the encoding part does not apply for HTTP even if strictly following the spec, but we can sort that out later in the PR. :-)
One interface thing to keep in mind is that looping over an iterator cannot be continued if the iterator raises.
That's why warcio's digest verification has a complicated interface, with 4 options: don't check (default), record problems but carry on, record problems and print something to stderr but carry on, and finally raise on problems.
Right. But should the iterator be resumable if the underlying stream is not a valid WARC file like in the examples above? For digest verification, it makes sense to log digest mismatches and check the rest of the file. Similarly, content type mismatches or invalid payload data (e.g. corrupted images) can and should be handled downstream. But that isn't really possible or sensible if the file isn't a parsable WARC in the first place.
Generic recovery from such a situation isn't possible either. I've had a case in the past where a record in the middle of a WARC was truncated for unknown reasons. Fortunately, the file used per-record compression, so some nasty processing allowed to find the next record and then produce a new file without the offending record. But that's not possible in the general case because the file might be compressed as a whole rather than per-record or, even worse, gzip member boundaries might be offset entirely compared to the WARC record boundaries. You can't simply decompress everything and then search for '\r\n\r\nWARC/1.0\r\n' either because that string could appear within a record, too.
I suppose it makes sense to split the points mentioned above into two categories.
First, there are hard parsing errors. These are errors that are absolutely impossible to recover from. For example, if a file doesn't start with WARC/1.1 + CRLF (or other valid known version) or there isn't an empty line marking the end of the headers, it simply cannot be parsed. A Content-Length header would also have to be present, and after reading that many bytes from the payload, there must be a double CRLF.
Second, there are softer parsing errors. Examples of this include header names that aren't valid UTF-8, missing any of the other required header fields, header lines (that aren't continuation lines) missing a colon, or LF instead of CRLF as line endings. I'm not sure about these. Part of me wants to handle these exactly like hard errors. Accepting files that don't conform to the spec leads to exactly the same issue as we have with HTTP or IRC: over time, more and more content would exist that is not conformant, and so everyone has to adapt to also handle these non-conformant cases without any proper documentation of this stuff. I feel like this is even more problematic in the context of archival formats, which are meant to be preserved forever.
btw I have a not-quite-finished develop-warcio-test branch in the repo that is capable of complaining about soft parsing errors. There are tons of WARCs out there with problems.