warcat issues

Extract performance is extremely slow on megawarcs

1

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download,...

gwern

help wanted

Feature: extract WARCs specified with index/length

1

In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to...

gwern

enhancement

Feature: extract only files matching a regexp

In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see https://github.com/chfoo/warcat/issues/7 ). It would...

gwern

enhancement

Support older Python 2.7

2

With chfoo/wpull being a success at supporting Python 2 using the latest lib3to2, Warcat shouldn't have problems with being backported.

chfoo

enhancement

Support warnings when WARC field name casing don't match hanzo's warc-tools.

1

For example, hanzo's warc-tools expects `WARC-Type` and not `Warc-Type`. The ISO spec says that field names are case-insensitive, but implementations may not follow the spec closely. The verify should warn...

chfoo

enhancement

warcat
warcat copied to clipboard

Metadata

Extract performance is extremely slow on megawarcs

Feature: extract WARCs specified with index/length

Feature: extract only files matching a regexp

Support older Python 2.7

Support warnings when WARC field name casing don't match hanzo's warc-tools.

← Metadata

Owner

Metadata

warcat warcat copied to clipboard

Metadata

Extract performance is extremely slow on megawarcs

Feature: extract WARCs specified with index/length

Feature: extract only files matching a regexp

Support older Python 2.7

Support warnings when WARC field name casing don't match hanzo's warc-tools.

← Metadata

Owner

Metadata

warcat
warcat copied to clipboard