warcat icon indicating copy to clipboard operation
warcat copied to clipboard

Tool and library for handling Web ARChive (WARC) files.

Results 15 warcat issues
Sort by recently updated
recently updated
newest added

According to the WARC 1.0 and 1.1 specifications "[t]he WARC-Refers-To field shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records". In warcat's source code, verify_refers_to does not...

bug

I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned: HTTP/1.1 302 Content-Type: text/html nnCoection: close Content-Length:...

bug

I'm getting a lot of these errors - some pages work just fine, all the warc files I'm reading have HTML, the error itself is strange enough since `200 ok`...

I was surprised that example provided in documentation: ``` python >>> import warcat.model >>> warc = warcat.model.WARC() >>> warc.load('example/at.warc.gz') >>> len(warc.records) ``` Reads everything into memory. And there is no...

enhancement

WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example: > wpull --version 1.2.3 > wpull https://example.org/ --warc-file...

bug

More accurately, how am I supposed to handle a "file" that is really just a bunch of bytes? Ideally, I would like to use a `BinaryIO` object, however, these don't...

bug

Currently warcat gives the following error on revisit records from a deduplicated WARC: ```Record failed validation Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action action(record) File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py",...

enhancement

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated...

enhancement

See https://github.com/chfoo/warcat/issues/10#issuecomment-196939147 for details

bug