archives icon indicating copy to clipboard operation
archives copied to clipboard

scholarpedia.org

Open rht opened this issue 10 years ago • 11 comments

LICENSE: CC BY-NC-SA 3.0 [1] Like SEP but for science, e.g. http://www.scholarpedia.org/article/Faddeev-Popov_ghosts by Faddeev himself. There is an outdated archive in https://archive.org/details/wiki-scholarpediaorg_w.

[1] http://www.scholarpedia.org/article/Scholarpedia:Terms_of_Use#Scholarpedia.27s_Licenses_to_You.2C_and_Your_license_to_parties_other_than_Scholarpedia

rht avatar Oct 19 '15 15:10 rht

SGTM! We can do this once #20 is resolved.

davidar avatar Oct 20 '15 08:10 davidar

There is a newer version now available at https://archive.org/details/wiki-scholarpediaorg-20151102

vitzli avatar Nov 05 '15 15:11 vitzli

:pushpin: /ipfs/Qmaskk1Egq5zmZsGTd7dwNiiK1cwfmx7k1StG1WJQjwGDm

The articles are here.

@DataWraith Feel like converting these to HTML? :)

It's quite a bit smaller than wikipedia, so should hopefully be less problematic.

davidar avatar Nov 06 '15 12:11 davidar

Heh. Eventually I'd like to write a program that converts a MediaWiki dump to HTML (probably by running it through pandoc), but right now I'm fairly busy, sorry.

I could only do the Wikipedia dump because a third party provided a dump in the OpenZIM format, and an easy-to-use library was available for reading and converting that.

With a raw XML dump, I'd have to roll my own solution, which would take more time than I currently have.

DataWraith avatar Nov 06 '15 16:11 DataWraith

(@vitzli thanks for updating the archive in archive.org)

rht avatar Nov 07 '15 10:11 rht

@DataWraith No worries. I might have a go at getting it to render with https://github.com/davidar/markup.rocks

@vitzli didn't realise you where the one who pushed the updated copy - thanks :)

davidar avatar Nov 07 '15 11:11 davidar

I took another look at this, and wanted to share what I found, in case it is useful to the next person.

Extracting the article markup from the XML dump is pretty easy, actually. But just having the article markup doesn't really gain you much. Simple articles can be rendered through pandoc, but more complicated elements (Images, Math, Templates) tend to break things.

I think our best bet is for someone to actually setup a MediaWiki instance and then use MWDumper to load the dump, and then export to HTML with mwoffliner. From what I can tell, this is the workflow that was used to create the HTML content for the ZIM files I used to dump Wikipedia.

The entire process is pretty convoluted though (Database, MediaWiki, Redis, Node...), so I'm currently not willing to tackle it.

If I were to do it, I'd probably try to setup everything in Docker containers with Docker Compose though, so that it is repeatable and applicable to other Wiki dumps.

Edit: Okay, so I couldn't resist fiddling around with this, despite my earlier words. Took much less time than I estimated too, because I could draw on pre-made docker images. The hard part (MWDumper) is yet to come, but I'm confident I'll have this figured out soonish, maybe even this weekend.

DataWraith avatar Nov 28 '15 11:11 DataWraith

sigh

This is much harder than it looked in the beginning. I realize I'm flip-flopping on this a lot -- should've kept my mouth shut from the beginning. Anyway. This post is as much for venting as for information's sake, so feel free to ignore it.

I wanted the process of creating HTML dumps from XML dumps to be repeatable, so I set up everything in Docker containers. Turns out that the pre-made docker containers for the necessary software I could find are mostly outdated, so I had to make them from scratch after running into problems with version incompatibilities.

I managed to setup a local MediaWiki instance with a MySQL database and import the Scholarpedia dump using MWDumper in an automated and repeatable fashion, but getting MediaWiki to render mathematical equations took the better part of the weekend (TeX didn't work at all, no matter what I tried, so I had to switch to Mathoid, which meant getting yet another web service up and running...), and it's still not working to my satisfaction (occasionally returns HTTP 400 -- Bad Request). It doesn't help that the documentation on any of this is extremely sparse.

The entire process looks like this:

  1. Start MySQL and create the wiki database skeleton
  2. Run MWDumper to fill the database with the Scholarpedia articles
  3. Start the MediaWiki container
  4. Start the Mathoid container (for equation rendering)
  5. Start the Parsoid container (for HTML extraction)

Remaining work

  • Images need to be imported.

    There is a PHP script included with MediaWiki that should do that. But I'm not expecting it to be easy.

  • The Main_Page has custom CSS templates that MediaWiki isn't parsing out of the box, displaying them verbatim instead.

  • Actually creating static HTML files

    As I mentioned in the previous entry, mwoffliner should be able to use Parsoid to extract HTML via the MediaWiki API. However, it looks non-trivial to setup. It should be possible to create Docker containers for it, but that will take a while yet, so don't hold your breath. :/

DataWraith avatar Nov 29 '15 17:11 DataWraith

(sounds more doable, as in, less headache than latex->html) @DataWraith is the conversion using parsoid lossless?

rht avatar Dec 01 '15 12:12 rht

Parsoid is intended to be able to convert from MediaWiki markup to HTML and back in a lossless fashion (they do 'round trip testing'). I haven't noticed any mistakes with the conversion, but from what I gather from the limited documentation available, the conversion process isn't 100% perfect yet.

The fact that they need to be able to make round trips also bloats the generated HTML somewhat. The files use absolute links too, so the additional step of using mwoffliner is necessary to produce an IPFS-suitable folder of files. I'll try to get that working next weekend (so that I have something to show even if the equations don't work quite right yet), but given my over-optimism so far, I don't want to promise anything.

DataWraith avatar Dec 01 '15 18:12 DataWraith

Hrm, it's unfortunate that MediaWiki is such a beast.

I've also converted it to a GitHub Wiki (example). It's somewhat passable, but definitely not perfect.

davidar avatar Dec 02 '15 12:12 davidar