sfm-ui
sfm-ui copied to clipboard
Fix occasional bug in iterating over gzipped WARC's with missing headers
For at least one collection (0287d41512b3492b801db3256112c103), the Twitter rest exporter throws a UnicodeDecodeError. In this case, the content-encoding header, which should be set to gzip, was either missing or duplicated by a different value for a certain number of lines in the warc.gz files. The warcio.WARCIterator class, which is used by warc_iter.py to read the WARC's, defaults in these cases to a type of reader that does not allow for proper decoding of the content, which, in every case tested, appears to be an empty bytestring.
Solution: in warc_iter.py, wrap the line line = stream.readline().decode('utf-8') in a try/except block, simply skipping the line if the decoding fails.