warcat icon indicating copy to clipboard operation
warcat copied to clipboard

Tool and library for handling Web ARChive (WARC) files.

Results 15 warcat issues
Sort by recently updated
recently updated
newest added

I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download,...

help wanted

In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to...

enhancement

In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see https://github.com/chfoo/warcat/issues/7 ). It would...

enhancement

With chfoo/wpull being a success at supporting Python 2 using the latest lib3to2, Warcat shouldn't have problems with being backported.

enhancement

For example, hanzo's warc-tools expects `WARC-Type` and not `Warc-Type`. The ISO spec says that field names are case-insensitive, but implementations may not follow the spec closely. The verify should warn...

enhancement