ArchiveSpark
ArchiveSpark copied to clipboard
WARC files written in ArchiveSpark incompatible with warcio
warcio raises
warcio.exceptions.ArchiveLoadFailed: Invalid WARC record, first line: WARC-Type: response
at the second WARC record (after the warc-info record) in a WARC file written with ArchiveSpark.
Both state that they use ISO http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf
warcio works fine for WARC-files written with Heritrix
I posted an issue on warcio as well.
warcio also returns a warning before the error:
WARNING: Record not followed by newline, perhaps Content-Length is invalid
Offset: 433
Remainder: b'WARC/1.0\r\n'
It could be that ArchiveSpark should write an additional empty line between the records.
warcio.statusandheaders.StatusAndHeadersParserException: Expected Status Line starting with ['WARC/1.1', 'WARC/1.0', 'WARC/0.17', 'WARC/0.18'] - Found: WARC-Type: response