django-wiki icon indicating copy to clipboard operation
django-wiki copied to clipboard

feature request: Import MediaWiki XML dump

Open ghost opened this issue 13 years ago • 10 comments

Do you have any plugin or extension to import XML dump from a wiki manage by MediaWiki engine? It will be a nice feature to help to migrate to your engine.

ghost avatar Oct 18 '12 14:10 ghost

Hi there!

What you're asking for is quite complicated, since Mediawiki has a very rich language with many extensions. Furthermore, there's also revisions, images and attachments to consider.

It could be done in a crappy way, ie. just importing article slugs, titles and the text body with a simple conversion.

The result would be that a manual rework would have to be done after. And with regards to images... that would just be really complicated.

benjaoming avatar Oct 30 '12 13:10 benjaoming

Hi :)

While I realize, that this is a complicated feature, it would be incredibly nice to have. Even though it would just copy text and not include attachments (such as images), just the part where you'd fetch all the pages and maintain the links between them, would be very valuable. At least to me it would.

I'm looking for a replacement for my company's MediaWiki, and your project seems like a great candidate.

While I understand that this import feature is not part of your main concern with this project, I would certainly find it useful.

eldamir avatar Nov 19 '12 10:11 eldamir

I'm currently working on this project:

https://github.com/benjaoming/python-mwdump-tools

It might be of interest for you as it gives you a pretty simple XmlParser from which you can extend handle_page and maybe get the conversion done with some pypandoc?

benjaoming avatar Jul 16 '13 21:07 benjaoming

Also check #275 :)

the-glu avatar Jun 23 '14 18:06 the-glu

Hey friends I wanted to bump this a bit since last update on this issue is 10 years ago (!)

I've got a mediawiki wiki that I need to import to django-wiki. I don't care about revisions/history nor images, I only want to insert the current pages of the mediawiki wiki. Is there a way to do that ? I've seen in the project's history that there was a management command that could be used but this management command was removed (!) instead of fixed ?

Thank you

spapas avatar Feb 26 '24 09:02 spapas

@spapas you can still try to use some of the code in https://github.com/django-wiki/django-wiki/pull/275 for your own project (it doesn't live in django-wiki currently because it lacked tests and probably broke).

The quickest road to success is likely to make this work in your own project and do exactly the customizations that you need without worrying about universal use-cases.

benjaoming avatar Feb 26 '24 13:02 benjaoming

Hey friends, using the code in #275 as a basis, I implemented a simple management command that should import from a mediawiki xml dump and works with latest django-wiki version and latest mediawiki version. It needs lxmlto parse the mediawiki xml dump and unidecode to convert non-latin characters to ascii. It uses pandoc to do the actual markdown -> md convert (I have tested it on windows and it works great).

Put the following on your management commands folder and run it like python manage.py import_mediawiki dump.xml


from django.core.management.base import BaseCommand
from wiki.models.article import ArticleRevision, Article
from wiki.models.urlpath import URLPath
from django.contrib.sites.models import Site
from django.template.defaultfilters import slugify
import unidecode
from django.contrib.auth import get_user_model
import datetime
import pytz
from django.db import transaction
import subprocess
from lxml import etree


def slugify2(s):
    return slugify(unidecode.unidecode(s))


def convert_to_markdown(text):
    proc = subprocess.Popen(
        ["pandoc", "-f", "mediawiki", "-t", "gfm"],
        stdout=subprocess.PIPE,
        stdin=subprocess.PIPE,
    )
    proc.stdin.write(text.encode("utf-8"))
    proc.stdin.close()
    return proc.stdout.read().decode("utf-8")


def create_article(title, text, timestamp, user):
    text_ok = (
        text.replace("__NOEDITSECTION__", "")
        .replace("__NOTOC__", "")
        .replace("__TOC__", "")
    )

    text_ok = convert_to_markdown(text_ok)

    article = Article()
    article_revision = ArticleRevision()
    article_revision.content = text_ok
    article_revision.title = title
    article_revision.user = user
    article_revision.owner = user
    article_revision.created = timestamp
    article.add_revision(article_revision, save=True)
    article_revision.save()
    article.save()
    return article


def create_article_url(article, slug, current_site, url_root):

    upath = URLPath.objects.create(
        site=current_site, parent=url_root, slug=slug, article=article
    )
    article.add_object_relation(upath)


def import_page(current_site, url_root, text, title, timestamp, replace_existing, user):
    slug = slugify2(title)

    try:
        urlp = URLPath.objects.get(slug=slug)

        if not replace_existing:
            print("\tAlready existing, skipping...")
            return

        print("\tDestorying old version of the article")
        urlp.article.delete()

    except URLPath.DoesNotExist:
        pass

    article = create_article(title, text, timestamp, user)
    create_article_url(article, slug, current_site, url_root)


class Command(BaseCommand):
    help = "Import everything from a MediaWiki XML dump file. Only the latest version of each page is imported."
    args = ""

    articles_worked_on = []
    articles_imported = []
    matching_old_link_new_link = {}

    def add_arguments(self, parser):
        parser.add_argument("file", type=str)

    @transaction.atomic()
    def handle(self, *args, **options):
        user = get_user_model().objects.get(username="spapas")
        current_site = Site.objects.get_current()
        url_root = URLPath.root()

        tree = etree.parse(options["file"])
        pages = tree.xpath('// *[local-name()="page"]')
        for p in pages:
            title = p.xpath('*[local-name()="title"]')[0].text
            print(title)
            revision = p.xpath('*[local-name()="revision"]')[0]
            text = revision.xpath('*[local-name()="text"]')[-1].text
            timestamp = revision.xpath('*[local-name()="timestamp"]')[0].text
            timestamp = datetime.datetime.strptime(timestamp, "%Y-%m-%dT%H:%M:%SZ")
            timestamp_with_timezone = pytz.utc.localize(timestamp)

            import_page(
                current_site,
                url_root,
                text,
                title,
                timestamp_with_timezone,
                True,
                user,
            )

Please notice that this tries to find an spapas user to assign the owner of the pages to (you can leave that as None or add your own user). Also I haven't tested if it works fine when you've got multiple revisions of each page; it tries to pick the text of the latest one (text = revision.xpath('*[local-name()="text"]')[-1].text) but I'm not sure it will work properly. Better to be safe by including only the latest revision of each article on your mediawiki dump. Also you can pass True or False to import_page in order to replace or skip existing pages.

spapas avatar Feb 29 '24 07:02 spapas

Thanks for sharing! I can only imagine that this is the perfect kind of boilerplate for someone to get started. Actually, it could fit very well in the documentation as a copy-paste example.

benjaoming avatar Feb 29 '24 08:02 benjaoming

I'll try to add a PR on the docs for that!

spapas avatar Feb 29 '24 11:02 spapas

@spapas it would fit well next to the How-To about Disquss comments: https://django-wiki.readthedocs.io/en/main/tips/index.html

(but the docs will be restructured, so don't worry too much about the location)

benjaoming avatar Feb 29 '24 11:02 benjaoming