warcio icon indicating copy to clipboard operation
warcio copied to clipboard

Not compatible with WARC-files/records writtin by ArchiveSpark

Open parismic opened this issue 4 years ago • 2 comments

warcio raises warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response at the second WARC record in a WARC file written with ArchiveSpark Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf

warcio also returns a warning before the error:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'

It could be that ArchiveSpark should write an additional empty line between the records or warcio is not in line with the ISO.

warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response

I'll post this issue on ArchiveSpark as well. Does anyone know more?

parismic avatar Jul 13 '21 08:07 parismic

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

wumpus avatar Jul 13 '21 13:07 wumpus

It's a pretty straight-forward thing to look at if you sent us a link to an actual warc that has this problem.

I tried to open a few WARCs using WebRecorder Player and got this exact error message, I don't know if they were created via ArchiveSpark but maybe it can be useful to solve the problem. They can be found here.

slaimon avatar Apr 26 '24 21:04 slaimon