warcat
warcat copied to clipboard
Feature: extract only files matching a regexp
In dealing with a megawarc, any reasonably broad set of results will have many hits, possibly too many to hand-write dd calls to extract efficiently (see https://github.com/chfoo/warcat/issues/7 ).
It would be useful if you could pass warcat a regexp like .*foo\.wordpress\.com.*
to extract all files in a megawarc dealing with a particular website. This can be approximated by telling warcat to extract all files and then deleting non-matches with find
or other shell script approaches, but at the cost of far more disk IO, temporary storage, and having to work with find
. (It might also be faster, aside from the disk IO reduction, depending on whether the format stores filenames and warcat can skip over all non-matching warcs. I don't know the details there.)