jwarc
jwarc copied to clipboard
ClueWeb09 WARC files faile to parse
The ClueWeb09 dataset WARC files (see sample files) use a single line feed \n
as separator between WARC headers. The WarcParser expects \r\n
(which would conform to the standard) and fails:
Exception in thread "main" org.netpreserve.jwarc.ParsingException: invalid WARC record at position 9: WARC/0.18<-- HERE -->\nWARC-Type: warcinfo\nWARC-Date: 2009-03-...
See also #25 for a similar issue regarding HttpParser.