wikiteam icon indicating copy to clipboard operation
wikiteam copied to clipboard

MediaWiki skins and resources not archived

Open emijrp opened this issue 10 years ago • 12 comments

From [email protected] on January 22, 2014 16:36:47

When we archive wikis, it would be very nice to also try to archive the skin as well (css, images, etc?), in a separate folder

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=82

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 07:26:07

Maybe MatmaRex has suggestions on how to do this.

Cc: matma.rex

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 12:01:34

Following a chat with him, I think we're just going to do something like this right after saving index.html: wget --page-requisites -e "robots=off" --no-directories --directory-prefix=requisites http://wiki.xkcd.com/wgh/index.php?debug=true The example produces something like this:

Total wall clock time: 20s Downloaded: 53 files, 375K in 1,6s (242 KB/s) $ ls requisites/ 15px-800px-Flag_of_Sweden.png Checker-16x16.png?2013-11-23T23:33:20Z 15px-Flag_of_Chile.png discussionitem_icon.gif?2013-11-23T23:33:20Z 15px-Flag_of_France.png document.png?2013-11-23T23:33:20Z 15px-Flag_of_Germany.png external-ltr.png?2013-11-23T23:33:20Z 15px-Flag_of_Mexico.png feed-icon.png?2013-11-23T23:33:20Z 15px-Flag_of_Spain.png file_icon.gif?2013-11-23T23:33:20Z 173px-1-26-2014_Humbucker_Hashpoint.jpg headbg.jpg?2013-11-23T23:33:20Z 173px-20140118_Cles_006.jpg help-question.gif?2013-11-23T23:33:20Z 173px-2014-01-20_47_8_locked_twice.jpg help-question-hover.gif?2013-11-23T23:33:20Z 173px-2014-01-21_-35_149_14.42.10.jpg Holidaylogo.png 173px-2014-01-21_42_-85_3.jpg index.php?debug=true 173px-2014-01-22_43_-116_train.jpg load.php?debug=true&lang=en&modules=mediawiki.legacy.commonPrint&only=styles&skin=monobook&* 173px-2014-01-23_43_-116_geohasher.jpg load.php?debug=true&lang=en&modules=mediawiki.legacy.shared&only=styles&skin=monobook&* 173px-2014-01-25_16.35.27.jpg load.php?debug=true&lang=en&modules=site&only=scripts&skin=monobook&* 174px-2014-01-18_52_13_GeorgDerReisende_5370.jpg load.php?debug=true&lang=en&modules=site&only=styles&skin=monobook&* 174px-2014-01-19_52_13_GeorgDerReisende_5524.jpg load.php?debug=true&lang=en&modules=skins.monobook&only=styles&skin=monobook&* 175px-2014-01-19_43_-121_grins.jpg load.php?debug=true&lang=en&modules=startup&only=scripts&skin=monobook&* 175px-2014-01-20_34_-118_17-25-38-320.jpg lock_icon.gif?2013-11-23T23:33:20Z 175px-2014-01-20_44_-122_grins.jpg magnify-clip.png 175px-2014-01-20_45_-122.JPG mail_icon.gif?2013-11-23T23:33:20Z 175px-2014-01-26_50_8_hashgrin.jpg news_icon.png?2013-11-23T23:33:20Z 175px-2014-01-28_50_8_hashgrin.jpg poweredby_mediawiki_88x31.png 180px-2009-04-25_49_-123.grouppose.JPG spinner.gif?2013-11-23T23:33:20Z 400px-Coordinates.png tipsy-arrow.gif?2013-11-23T23:33:20Z ajax-loader.gif?2013-11-23T23:33:20Z user.gif?2013-11-23T23:33:20Z audio.png?2013-11-23T23:33:20Z video.png?2013-11-23T23:33:20Z bullet.gif?2013-11-23T23:33:20Z

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 12:28:34

Note, I'd however delete the index.php* file to only keep index.html in the main directory and because otherwise we should redact the IP there too, as we do with index.html.

Summary: MediaWiki skins and resources not archived (was: MediaWiki Skins not archived)
Labels: -Type-Defect Type-Enhancement

emijrp avatar Jun 25 '14 10:06 emijrp

From [email protected] on February 13, 2014 23:31:35

It's not always as easy, i.e. https://bugzilla.wikimedia.org/show_bug.cgi?id=61249 suggests wget -e robots=off --page-requisites --convert-links --adjust-extension --span-hosts --domains meta.wikimedia.org,bits.wikimedia.org meta.wikimedia.org

emijrp avatar Jun 25 '14 10:06 emijrp

There are a few options here. The one mentioned above (with --convert-links): wget --page-requisites -e "robots=off" --convert-links --no-directories --directory-prefix=requisites This does not work when resources are being loaded from many domains, but should work on the majority of wikis.

This should work for all wikis, but might get more than we want (including ads): wget -e robots=off --page-requisites --convert-links -H --no-directories --directory-prefix=requisites It loads from any domain (-H).

If we only want to get styles/scripts, things loaded from PHP urls, and the HTML page: wget -e robots=off --page-requisites --convert-links --accept=css,js,html,php,php5 -H --no-directories --directory-prefix=requisites

Any of these can have --adjust-extension added to change the .php to .html, etc.

PiRSquared17 avatar Sep 14 '14 17:09 PiRSquared17

Pi R. Squared:

There are a few options here.

Second is fine, I'm not too worried about ads. Hardcoded extensions whitelist I don't like that much and it's fine e.g. to get some icons in whatever format.

nemobis avatar Sep 14 '14 17:09 nemobis

Wget has issues with filename restrictions on different OSes. Although it would be possible to force it to always use (for example) Windows filenames, this means files with names over a certain size could not be downloaded nicely using the commands above. Even so, it may be better to use wget than to write another script to do basically the same thing.

PiRSquared17 avatar Sep 19 '14 23:09 PiRSquared17

Pi R. Squared, 20/09/2014 01:04:

Wget has issues with filename restrictions on different OSes. Although it would be possible to force it to always use (for example) Windows filenames, this means files with names over a certain size could not be downloaded nicely using the commands above.

We use wget for files and ignore Windows problems, too...

Even so, it may be better to use wget than to write another script to do basically the same thing.

The only concrete solution I can think of to work on Windows is to avoid storing files in the file system at all. Perhaps we could save the files straight into a tar or something. But there's a bug for that, can they be handled together?

nemobis avatar Sep 20 '14 06:09 nemobis

Note, wget --restrict-file-names=windows exists.

       When "windows" is given, Wget escapes the characters \, |, /, :, ?, ", *, <, >, and the control characters in the ranges 0--31
       and 128--159.  In addition to this, Wget in Windows mode uses + instead of : to separate host and port in local file names, and
       uses @ instead of ? to separate the query portion of the file name from the rest.  Therefore, a URL that would be saved as
       www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as www.xemacs.org+4300/search.pl@input=blah in Windows
       mode.  This mode is the default on Windows.

nemobis avatar Sep 26 '14 03:09 nemobis

Yes, but that doesn't truncate file names that are too long. Like resourceLoader requests with loads of modules. :(

PiRSquared17 avatar Sep 26 '14 04:09 PiRSquared17

How about just saving them in a warc.gz?! That would work, right?

PiRSquared17 avatar Mar 02 '15 03:03 PiRSquared17

PiRSquared17, 02/03/2015 04:21:

How about just saving them in a warc.gz?! That would work, right?

We could do both. Our dumps are usually intended to allow reconstructing the wiki entirely, so this bug is about fetching some of the server-side files (mainly JavaScript). WARC is useful if one has WARCs of the rest of the wiki as well.

nemobis avatar Mar 02 '15 07:03 nemobis