ArchiveSpark icon indicating copy to clipboard operation
ArchiveSpark copied to clipboard

WARC files written in ArchiveSpark incompatible with warcio

Open parismic opened this issue 2 years ago • 0 comments

warcio raises warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response at the second WARC record (after the warc-info record) in a WARC file written with ArchiveSpark. Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf warcio works fine for WARC-files written with Heritrix I posted an issue on warcio as well.

warcio also returns a warning before the error:

WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'

It could be that ArchiveSpark should write an additional empty line between the records.

warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response

parismic avatar Jul 13 '21 09:07 parismic