grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page

Open ikreymer opened this issue 4 years ago • 1 comments

It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)

WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.

The Python wacz library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)

I think should just be able to call the create command from: https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19

It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.

The library is still new, so can definitely make any changes needed to support integration!

ikreymer avatar Feb 21 '21 20:02 ikreymer

grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.

ivan avatar Feb 23 '21 03:02 ivan