Obtaining Screenshot Image Files After Crawl
Summary
Can someone please point me to the best approach for dumping screenshots generated by a browsertrix crawl to a directory of image files? Thank you in advance :)
Background
I am attempting to use the browsertrix crawler to capture screenshots of each page visited during the crawl. I'm running the process on an arm-based mac and using docker compose to orchestrate and kickoff the process:
docker-compose run crawler crawl --url https://www.jacksonhewitt.com --generateWACZ \
--limit 50 --collection jacksonhewitt --workers 5 --screenshot fullPage \
--windowSize 1400,900
I can see in the console that screenshots are being captured and written to archive/screenshots.warc.gz:
"context":"general","message":"Screenshot (type: fullPage) for https://www.jacksonhewitt.com
/tax-help/irs/irs-audits-notices/irs-notice-cp42/ written to /crawls/collections/
jacksonhewitt/archive/screenshots.warc.gz"
I confirmed that the warc archive contains the screenshots I'm after using replayweb.page. For the life of me, however, I cannot figure out how to pull the captured images out of the warc file into a zipped directory of images.
Hi @thegrif, there are a few tools for inspecting WARC files and extracting content from them - I was just looking at https://nlnwa.github.io/warchaeology/ by the Norwegian National Library and believe the warc cat command could do what you're looking for. I haven't tested it in a while but I think 7-zip might also be able to extract content from WARCs.
Other tools to look at/try might be:
- https://github.com/chfoo/warcat
- https://github.com/recrm/ArchiveTools
Hope that helps!
@thegrif now that I'm testing this myself, I see your difficulty! Even with the tools I linked above, it seems that resource records like the screenshots aren't necessarily exported. I'm going to continue investigating this and will report back on the simplest method(s) I find for pulling the screenshots out of the warc.
FWIW, this seems out of scope for the crawler. We have discussed improvements to ReplayWeb.page, which could include filtering and downloading resources of a particular type. This would probably be the most friendly solution, short of building a separate tool.
I've tested the new #https://github.com/chfoo/warcat-rs by #@chfoo Does the job perfectly with the extract command. Files do miss a file extension. But adding that is easier then opening the screenshot warc in the browser version of replayweb and copying the images to an image editor.