warcat
warcat copied to clipboard
Tool and library for handling Web ARChive (WARC) files.
According to the WARC 1.0 and 1.1 specifications "[t]he WARC-Refers-To field shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records". In warcat's source code, verify_refers_to does not...
I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned: HTTP/1.1 302 Content-Type: text/html nnCoection: close Content-Length:...
I'm getting a lot of these errors - some pages work just fine, all the warc files I'm reading have HTML, the error itself is strange enough since `200 ok`...
I was surprised that example provided in documentation: ``` python >>> import warcat.model >>> warc = warcat.model.WARC() >>> warc.load('example/at.warc.gz') >>> len(warc.records) ``` Reads everything into memory. And there is no...
WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example: > wpull --version 1.2.3 > wpull https://example.org/ --warc-file...
More accurately, how am I supposed to handle a "file" that is really just a bunch of bytes? Ideally, I would like to use a `BinaryIO` object, however, these don't...
Currently warcat gives the following error on revisit records from a deduplicated WARC: ```Record failed validation Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action action(record) File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py",...
This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated...
See https://github.com/chfoo/warcat/issues/10#issuecomment-196939147 for details