grab-site
grab-site copied to clipboard
Consider an option to generate WACZ files after a crawl is done for better replay with ReplayWeb.page
It would be helpful for folks using grab-site and then replaying via replayweb.page to have grab-site generate a WACZ file after the crawl is done. (This workflow is mentioned in webrecorder/replayweb.page#6)
WACZ (https://github.com/webrecorder/wacz-format) provides a way to package the WARC, CDX and an optional page list into a single file (a zip file) such that it can be loaded quickly for replay.
The Python wacz
library (https://pypi.org/project/wacz) can be used to create the WACZ package (https://github.com/webrecorder/wacz-format/tree/main/py-wacz)
I think should just be able to call the create command from: https://github.com/webrecorder/wacz-format/blob/main/py-wacz/wacz/main.py#L19
It might make sense to pass in a page list, and there is an experimental option to do full-text extraction on pages as well.
The library is still new, so can definitely make any changes needed to support integration!
grab-site currently doesn't really have anyone developing it (I just try to keep the install steps working), but I have no objections to the addition of WACZ support.