warcat issues

2

According to the WARC 1.0 and 1.1 specifications "[t]he WARC-Refers-To field shall not be used in ‘warcinfo’, ‘response’, ‘resource’, ‘request’, and ‘continuation’ records". In warcat's source code, verify_refers_to does not...

RvanVeenendaal

bug

Malformed HTTP headers lead to "ValueError: need more than 1 value to unpack" crash

1

I have a WARC which contains an HTTP response whose headers are malformed. Specifically, it's from http://www.assoc-amazon.com/s/link-enhancer?tag=discount039-20&o=1 and this is the data returned: HTTP/1.1 302 Content-Type: text/html nnCoection: close Content-Length:...

JustAnotherArchivist

bug

http.client.BadStatusLine: http/1.1 200 OK

I'm getting a lot of these errors - some pages work just fine, all the warc files I'm reading have HTML, the error itself is strange enough since `200 ok`...

chris-aeviator

Add easy way to iterate over warc records

3

I was surprised that example provided in documentation: ``` python >>> import warcat.model >>> warc = warcat.model.WARC() >>> warc.load('example/at.warc.gz') >>> len(warc.records) ``` Reads everything into memory. And there is no...

sirex

enhancement

wpull WARCs cause "Content block length changed from X to Y" warnings on warcinfo record

WARCs from at least wpull 1.2.3 produce a warning of "Content block length changed from X to Y" for warcinfo records. Example: > wpull --version 1.2.3 > wpull https://example.org/ --warc-file...

JustAnotherArchivist

bug

Handling for "files" that are purely in memory?

2

More accurately, how am I supposed to handle a "file" that is really just a bunch of bytes? Ideally, I would like to use a `BinaryIO` object, however, these don't...

spott

bug

Support payload digest of revisit records

1

Currently warcat gives the following error on revisit records from a deduplicated WARC: ```Record failed validation Traceback (most recent call last): File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py", line 282, in action action(record) File "/usr/local/lib/python3.4/dist-packages/warcat/tool.py",...

Arkiver2

enhancement

URL agnostic deduplication of WARC

This would be useful for grabs where the exact same images are grabbed with different URLs. There should be a revisit record from an URL to a duplicated URL. Duplicated...

Arkiver2

enhancement

A name to a file object is not handled correctly

See https://github.com/chfoo/warcat/issues/10#issuecomment-196939147 for details

chfoo

bug

warcat
warcat copied to clipboard

Metadata

Add 'warcat' console_scripts entry point; also ignore *.egg-info

No mention of 'resource' in list at verify_refers_to

Malformed HTTP headers lead to "ValueError: need more than 1 value to unpack" crash

http.client.BadStatusLine: http/1.1 200 OK

Add easy way to iterate over warc records

wpull WARCs cause "Content block length changed from X to Y" warnings on warcinfo record

Handling for "files" that are purely in memory?

Support payload digest of revisit records

URL agnostic deduplication of WARC

A name to a file object is not handled correctly

← Metadata

Owner

Metadata

warcat warcat copied to clipboard

Metadata

← Metadata

Owner

Metadata

warcat
warcat copied to clipboard