jwarc icon indicating copy to clipboard operation
jwarc copied to clipboard

ClueWeb09 WARC files faile to parse

Open sebastian-nagel opened this issue 4 years ago • 0 comments

The ClueWeb09 dataset WARC files (see sample files) use a single line feed \n as separator between WARC headers. The WarcParser expects \r\n (which would conform to the standard) and fails:

Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 9: WARC/0.18<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2009-03-...

See also #25 for a similar issue regarding HttpParser.

sebastian-nagel avatar Mar 12 '20 22:03 sebastian-nagel