webarchive-commons icon indicating copy to clipboard operation
webarchive-commons copied to clipboard

ARCRecord entered inconsistent state for some ARC files

Open tokee opened this issue 10 years ago • 6 comments

If no status could be found, ARCReader#readHttpHeader read forward until next status in the originating ARC-Stream, thereby reading past the current record and into the next one. This was combined with a faulty exclusion of status lines containing colons.

This pull request contains a sample ARC file that triggered the two bugs as well as unit tests for the bugs.

tokee avatar Feb 18 '15 13:02 tokee

This 'off spec' hook looks like an ARC format variation edge case to me (see https://github.com/iipc/webarchive-commons/pull/40/files#diff-1a20f5668b210061c0267c7f079c8830L600), so I'd really like @nlevitt or @kngenie to take a look at this and see if they are happy with the change.

anjackson avatar Feb 18 '15 14:02 anjackson

With the check for empty line (i.e. no more HTTP-response lines), keeping the check for colon just means that individual records are wrongly processed instead of the rest of the ARC file. But I do find it curious to discard well-formed status lines. If the colon-check it kept, maybe a rationale could be given as a comment in the code?

tokee avatar Feb 18 '15 14:02 tokee

@tokee And, just to check, the removal of that colon-test is the only material change you're making right? All the other stuff is in the /test/ area, right?

anjackson avatar Feb 18 '15 14:02 anjackson

Removing of colon and adding of newline-check are the only real changes, yes. We are in the process of testing with 95 known problem-ARCs (~100MB each) locally and I am happy to say that the first 9 has passed without ARC-parse-showstoppers with the new ARCReader.

tokee avatar Feb 18 '15 15:02 tokee

Out test run finished with 94/95 previously partly-skipped ARC-files being fully processed. The last one seems to fail due to the HTTP-header-parser being too greedy (I'll update issue #41 with a description).

@thomasegense informs me that about 3% of our ARC files gets marked as problematic when indexing without the patch, so it seems that colon is not uncommon in HTTP-status lines.

tokee avatar Feb 18 '15 20:02 tokee

Seems to me that this change is all about supporting real arcs which actually conforms to the spec. But since it is possible for this change to let through records to OpenWayback which today would fail, I propose it to be included in a minor release and not the next bugfix release.

johnerikhalse avatar Apr 26 '16 12:04 johnerikhalse