warcat
warcat copied to clipboard
Tool and library for handling Web ARChive (WARC) files.
I was recently working with a megawarc from the Google Reader crawl of 25GB or so in size on an Amazon EC2 server. This took a few hours to download,...
In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to...
In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see https://github.com/chfoo/warcat/issues/7 ). It would...
With chfoo/wpull being a success at supporting Python 2 using the latest lib3to2, Warcat shouldn't have problems with being backported.
For example, hanzo's warc-tools expects `WARC-Type` and not `Warc-Type`. The ISO spec says that field names are case-insensitive, but implementations may not follow the spec closely. The verify should warn...