wikiteam MediaWiki skins and resources not archived

From [email protected] on January 22, 2014 16:36:47

When we archive wikis, it would be very nice to also try to archive the skin as well (css, images, etc?), in a separate folder

Original issue: http://code.google.com/p/wikiteam/issues/detail?id=82

Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 07:26:07

Maybe MatmaRex has suggestions on how to do this.

Cc: matma.rex

Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 12:01:34

Following a chat with him, I think we're just going to do something like this right after saving index.html: wget --page-requisites -e "robots=off" --no-directories --directory-prefix=requisites http://wiki.xkcd.com/wgh/index.php?debug=true The example produces something like this:

Total wall clock time: 20s Downloaded: 53 files, 375K $ ls requisites/ 15px-800px-Flag_of_Sweden.png 15px-Flag_of_Chile.png 15px-Flag_of_France.png 15px-Flag_of_Germany.png 15px-Flag_of_Mexico.png 15px-Flag_of_Spain.png 173px-1-26-2014_Humbucker_Hashpoint.jpg 173px-20140118_Cles_006.jpg 173px-2014-01-20_47_8_locked_twice.jpg 173px-2014-01-21_-35_149_14.42.10.jpg 173px-2014-01-21_42_-85_3.jpg 173px-2014-01-22_43_-116_train.jpg 173px-2014-01-23_43_-116_geohasher.jpg 173px-2014-01-25_16.35.27.jpg 174px-2014-01-18_52_13_GeorgDerRei 174px-2014-01-19_52_13_GeorgDerRei 175px-2014-01-19_43_-121_grins.jpg 175px-2014-01-20_34_-118_17-25-38-320.jpg 175px-2014-01-20_44_-122_grins.jpg 175px-2014-01-20_45_-122.JPG 175px-2014-01-26_50_8_hashgrin.jpg 175px-2014-01-28_50_8_hashgrin.jpg 180px-2009-04-25_49_-123.grouppose.JPG 400px-Coordinates.png ajax-loader.gif?2013-11-23T23:33:20Z audio.png?2013-11-23T23:33:20Z bullet.gif?2013-11-23T23:33:20Z in 1,6s (242 KB/s) Checker-16x16.png?2013-11-23T23:33:20Z discussionitem_icon.gif?2013-11-23T23:33:20Z document.png?2013-11-23T23:33:20Z external-ltr.png?2013-11-23T23:33:20Z feed-icon.png?2013-11-23T23:33:20Z file_icon.gif?2013-11-23T23:33:20Z headbg.jpg?2013-11-23T23:33:20Z help-question.gif?2013-11-23T23:33:20Z help-question-hover.gif?2013-11-23T23:33:20Z Holidaylogo.png index.php?debug=true load.php?debug=true&lang=en&modules=mediawiki.legacy.commonPrint&only=styles&skin=monobook&* load.php?debug=true&lang=en&modules=mediawiki.legacy.shared&only=styles&skin=monobook&* load.php?debug=true&lang=en&modules=site&only=scripts&skin=monobook&* sende_5370.jpg load.php?debug=true&lang=en&modules=site&only=styles&skin=monobook&* sende_5524.jpg load.php?debug=true&lang=en&modules=skins.monobook&only=styles&skin=monobook&* load.php?debug=true&lang=en&modules=startup&only=scripts&skin=monobook&* lock_icon.gif?2013-11-23T23:33:20Z magnify-clip.png mail_icon.gif?2013-11-23T23:33:20Z news_icon.png?2013-11-23T23:33:20Z poweredby_mediawiki_88x31.png spinner.gif?2013-11-23T23:33:20Z tipsy-arrow.gif?2013-11-23T23:33:20Z user.gif?2013-11-23T23:33:20Z video.png?2013-11-23T23:33:20Z

Jun 25 '14 10:06 emijrp

From [email protected] on January 31, 2014 12:28:34

Note, I'd however delete the index.php* file to only keep index.html in the main directory and because otherwise we should redact the IP there too, as we do with index.html.

Summary: MediaWiki skins and resources not archived (was: MediaWiki Skins not archived)
Labels: -Type-Defect Type-Enhancement

Jun 25 '14 10:06 emijrp

From [email protected] on February 13, 2014 23:31:35

It's not always as easy, i.e. https://bugzilla.wikimedia.org/show_bug.cgi?id=61249 suggests wget -e robots=off --page-requisites --convert-links --adjust-extension --span-hosts --domains meta.wikimedia.org,bits.wikimedia.org meta.wikimedia.org

Jun 25 '14 10:06 emijrp

There are a few options here. The one mentioned above (with --convert-links): wget --page-requisites -e "robots=off" --convert-links --no-directories --directory-prefix=requisites This does not work when resources are being loaded from many domains, but should work on the majority of wikis.

This should work for all wikis, but might get more than we want (including ads): wget -e robots=off --page-requisites --convert-links -H --no-directories --directory-prefix=requisites It loads from any domain (-H).

If we only want to get styles/scripts, things loaded from PHP urls, and the HTML page: wget -e robots=off --page-requisites --convert-links --accept=css,js,html,php,php5 -H --no-directories --directory-prefix=requisites

Any of these can have --adjust-extension added to change the .php to .html, etc.

Sep 14 '14 17:09 PiRSquared17

Pi R. Squared:

There are a few options here.

Second is fine, I'm not too worried about ads. Hardcoded extensions whitelist I don't like that much and it's fine e.g. to get some icons in whatever format.

Sep 14 '14 17:09 nemobis

Wget has issues with filename restrictions on different OSes. Although it would be possible to force it to always use (for example) Windows filenames, this means files with names over a certain size could not be downloaded nicely using the commands above. Even so, it may be better to use wget than to write another script to do basically the same thing.

Sep 19 '14 23:09 PiRSquared17

Pi R. Squared, 20/09/2014 01:04:

Wget has issues with filename restrictions on different OSes. Although it would be possible to force it to always use (for example) Windows filenames, this means files with names over a certain size could not be downloaded nicely using the commands above.

We use wget for files and ignore Windows problems, too...

Even so, it may be better to use wget than to write another script to do basically the same thing.

The only concrete solution I can think of to work on Windows is to avoid storing files in the file system at all. Perhaps we could save the files straight into a tar or something. But there's a bug for that, can they be handled together?

Sep 20 '14 06:09 nemobis

Note, wget --restrict-file-names=windows exists.

       When "windows" is given, Wget escapes the characters \, |, /, :, ?, ", *, <, >, and the control characters in the ranges 0--31
       and 128--159.  In addition to this, Wget in Windows mode uses + instead of : to separate host and port in local file names, and
       uses @ instead of ? to separate the query portion of the file name from the rest.  Therefore, a URL that would be saved as
       www.xemacs.org:4300/search.pl?input=blah in Unix mode would be saved as www.xemacs.org+4300/search.pl@input=blah in Windows
       mode.  This mode is the default on Windows.

Sep 26 '14 03:09 nemobis

Yes, but that doesn't truncate file names that are too long. Like resourceLoader requests with loads of modules. :(

Sep 26 '14 04:09 PiRSquared17

How about just saving them in a warc.gz?! That would work, right?

Mar 02 '15 03:03 PiRSquared17

PiRSquared17, 02/03/2015 04:21:

How about just saving them in a warc.gz?! That would work, right?

We could do both. Our dumps are usually intended to allow reconstructing the wiki entirely, so this bug is about fetching some of the server-side files (mainly JavaScript). WARC is useful if one has WARCs of the rest of the wiki as well.

Mar 02 '15 07:03 nemobis

wikiteam wikiteam copied to clipboard

MediaWiki skins and resources not archived

wikiteam
wikiteam copied to clipboard