browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Obtaining Screenshot Image Files After Crawl

Open ghost opened this issue 2 years ago • 4 comments

Summary

Can someone please point me to the best approach for dumping screenshots generated by a browsertrix crawl to a directory of image files? Thank you in advance :)

Background

I am attempting to use the browsertrix crawler to capture screenshots of each page visited during the crawl. I'm running the process on an arm-based mac and using docker compose to orchestrate and kickoff the process:

docker-compose run crawler crawl --url https://www.jacksonhewitt.com --generateWACZ \
	--limit 50 --collection jacksonhewitt --workers 5 --screenshot fullPage \
	--windowSize 1400,900

I can see in the console that screenshots are being captured and written to archive/screenshots.warc.gz:

"context":"general","message":"Screenshot (type: fullPage) for https://www.jacksonhewitt.com
/tax-help/irs/irs-audits-notices/irs-notice-cp42/ written to /crawls/collections/
jacksonhewitt/archive/screenshots.warc.gz"

I confirmed that the warc archive contains the screenshots I'm after using replayweb.page. For the life of me, however, I cannot figure out how to pull the captured images out of the warc file into a zipped directory of images.

ghost avatar Mar 12 '23 17:03 ghost

Hi @thegrif, there are a few tools for inspecting WARC files and extracting content from them - I was just looking at https://nlnwa.github.io/warchaeology/ by the Norwegian National Library and believe the warc cat command could do what you're looking for. I haven't tested it in a while but I think 7-zip might also be able to extract content from WARCs.

Other tools to look at/try might be:

  • https://github.com/chfoo/warcat
  • https://github.com/recrm/ArchiveTools

Hope that helps!

tw4l avatar Mar 31 '23 18:03 tw4l

@thegrif now that I'm testing this myself, I see your difficulty! Even with the tools I linked above, it seems that resource records like the screenshots aren't necessarily exported. I'm going to continue investigating this and will report back on the simplest method(s) I find for pulling the screenshots out of the warc.

tw4l avatar Apr 11 '23 15:04 tw4l

FWIW, this seems out of scope for the crawler. We have discussed improvements to ReplayWeb.page, which could include filtering and downloading resources of a particular type. This would probably be the most friendly solution, short of building a separate tool.

ikreymer avatar Jun 15 '24 19:06 ikreymer

I've tested the new #https://github.com/chfoo/warcat-rs by #@chfoo Does the job perfectly with the extract command. Files do miss a file extension. But adding that is easier then opening the screenshot warc in the browser version of replayweb and copying the images to an image editor.

robert-1043 avatar Oct 11 '24 17:10 robert-1043