wayback-machine-downloader Scraping MediaWiki Sites Seems to be Broken

Scraping MediaWiki Sites Seems to be Broken

Open DavidBerdik opened this issue 3 years ago • 2 comments

trafficstars

I am trying to scrape a now-gone website that used MediaWiki.

http://web.archive.org/web/20161008144304/http://evllabs.com/jgaap/w/index.php/Main_Page

Unfortunately, the parameters that MediaWiki passes through the URL seem to be confusing this downloader, and the result of the scrape is basically unusable. Are there any known ways for dealing with this?

Apr 08 '22 02:04 DavidBerdik

Hey David,

Are there any known ways for dealing with this?

I've used this program here (which is loosely based on wayback-machine-downloader) was able to produce a WARC and from there used warc2zim to build a ZIM.

To use this ZIM, there's a binary of kiwix-serve here, so you can run the following command which starts a web server at http://127.0.0.1:8084 which serves the ZIM and therefore makes the wiki available locally:

./kiwix-tools_linux-x86_64-3.2.0-5/kiwix-serve -i 127.0.0.1 -p 8084 evllabs_wiki_2022-06.zim

A copy of that ZIM is available here (it takes around ~1h to build it).

Jun 26 '22 02:06 wsdookadr

Hello @wsdookadr,

Thanks for getting back to me! I have never heard of or seen the WARC or ZIM file formats before, but based on a quick Google search, it sounds like it's exactly what I am looking for. I am currently downloading the file you shared, and I'm looking forward to trying it out!

Jun 28 '22 02:06 DavidBerdik

wayback-machine-downloader wayback-machine-downloader copied to clipboard

Scraping MediaWiki Sites Seems to be Broken

wayback-machine-downloader
wayback-machine-downloader copied to clipboard