grab-site icon indicating copy to clipboard operation
grab-site copied to clipboard

Best way to grab this page?

Open sardaukar opened this issue 6 years ago • 6 comments

I am trying to grab https://www.versionmuseum.com/history-of/classic-mac-os and having difficulty understanding grab-site. If I just do a normal grab with the URL, it starts crawling half the internet. If I do --no-offsite-links it still grabs everything on the website. With --1 it's quick, but the images on the page only work if clicked on, and are broken otherwise. With --level=1 I get a 206MB WARC file.

What's the best way to capture this page, and only this page (none of the Amazon or Word history links on the sidebar) and still have working thumbnails?

sardaukar avatar Jul 20 '19 15:07 sardaukar

Hey sardaukar, I run versionmuseum.com. What are you trying to do?

cherry-vanilla avatar Jul 21 '19 03:07 cherry-vanilla

Trying to save that page about MacOS history locally. I get scared about interesting websites going down and wanted to preserve this one in my WARC collection. Isn't that what this project is for?

sardaukar avatar Jul 21 '19 08:07 sardaukar

Gotcha. Have no fear, we just re-launched the site a couple months ago -- it's gonna be around for a long time. We put the Mac OS page up only like a month ago. Hope you enjoyed the trip down memory lane!

cherry-vanilla avatar Jul 22 '19 00:07 cherry-vanilla

Yeah, but my point was on how to use grab-site to capture the page. Maybe this website won't go away anytime soon, but others might and they have pictures and I just wanted to know how to best use it.

sardaukar avatar Jul 22 '19 06:07 sardaukar

So there's no easy way to use grab-site to archive this page, then? Just so I can close the issue.

sardaukar avatar Aug 10 '19 12:08 sardaukar

Did you end up resolving this issue? If not, large WARC files are typical, even on seemingly small websites. I've had to stop many a crawl because my internet is too slow to upload more than 150mb to internet archive and I don't have good permanent storage.

YOu can try manually blocking the pages in question: image.

TheTechRobo avatar May 28 '21 17:05 TheTechRobo