wikitextparser icon indicating copy to clipboard operation
wikitextparser copied to clipboard

External links not updated on changing "string"

Open TrueBrain opened this issue 4 years ago • 3 comments

I might just be abusing wikitextparser at this point, but I find the behaviour of updating string sometimes a bit weird, so let me explain a bit what I am doing, and you might want to correct me at any point saying: no, that is not what this library can do for you :)

First, I parse the mediawiki file with wikitextparser.parse(). Next, I iterate all the templates, and replace them with the mediawiki file of their content, recursive. I update template.string with the result.

Now this works amazingly well (much better than I could hope for, tbh). Except for a few edge-cases.

Mostly, external_links are not updated. Example code:

import wikitextparser

wtp = wikitextparser.parse("{{test}}")
wtp.templates[0].string = "[https://link]"
print(wtp.external_links)

This will output an empty list. Doing the same for a WikiLink works fine btw. Looking at the code, ExternaLink is not a span, and not reloaded on string change. Not sure how/if this is fixable. For now, I simply run parse on the wtp.string after I replaced all the templates.

The other edge-cases are that if I parse things in the wrong order, I tend to hit DeadIndexError. Also, I have to iterate all items with reversed(), as otherwise also strange things happen. I don't have showcases for this, as they were mostly solved by getting the order right :D

TrueBrain avatar Oct 15 '20 13:10 TrueBrain

Interesting! Are trying to convert wikitext back to html? because if that's the case it will be a lot easier if you could somehow use the mediawiki's parser directly.

To be honest, wikitextparser was written mostly with small change in mind. Things like updating template parameters, updating wikilinks, fixing external links, etc. Your use-case sounds a lot more complex.

When you overwrite the string of a node, every information in that node will be lost. Every object that was pointing to the overwritten portion will be invalid:

>>> w = parse('{{a|{{b}}}}')
>>> a, b = w.templates
>>> a.string = 'X'
>>> w
WikiText('X')
>>> b
Template('')
>>> b.string = 'Y'
Traceback (most recent call last):
[...]
\wikitextparser\_wikitext.py", line 116, in __add__
    raise DeadIndexError(
wikitextparser._wikitext.DeadIndexError: this usually means that the object has died (overwritten or deleted) and cannot be mutated

As you can see, this results in DeadIndexError. Since the templates are sorted by index, reversing them will prevent this error because overwriting a later template will never overwrite earlier templates.


Regarding the main issue (external links not being updated), I think I can fix it. I'll have a more in-depth look later.

5j9 avatar Oct 16 '20 13:10 5j9

Awesome, tnx for the info, that really helps me understand it more :)

And yes, I am converting wikitext to html. And your library does the job very well, I have to say. Like really well. Currently, except for some misplaced <p>s in the mediawiki version we run, the output is identical :)

The main reason we are doing this, and not using the mediawiki parser, is both performance and resources. We are currently using mediawiki, but it is build in a time that considered different aspects important when building webapps. In result, running this on, for example, AWS ECS is a very expensive (in terms of money) undertaking. And don't get me starting on how to configure mediawiki to use, for example, OAuth2. For the last few weeks we have researched if we can switch to, for example, a git-based wiki like gollum (what GitHub uses for its wiki too). But the wikitext support for that is .. well .. not ideal. Templates are not supported, as example. We managed to write this in, but the performance and resource requirement was insane: 5+ seconds to render a page was not uncommon, and it took more than 1 GB of RAM. Now gollum uses wikicloth to render wikitext, both written in Ruby, a language I do not know / want to know. I have considered glueing the mediawiki parser only in there, but starting a PHP process is expensive and resource intens. So that brings us to my use-case: I was wondering how difficult it would be to do this in Python. As it turns out: about 4 hours to make a nice proof-of-concept, with your library as wikitext parser. Now a few hours more into the project, and it can render all pages OpenTTD has on its wiki, and most of them are very close to the original. And performance / resource-wise? Well, the slowest page is ~70ms, but most render in 30ms. And memory is around 30 MB. This is a HUGE difference :P

So yeah, now I know you meant your library to be used to make small changes, I can understand why it took me a bit of tinkering to get it to work, but boy ... it does perform :D Now I can scale this up in a decent way to run on AWS ECS with little effort .. and that makes me really happy :D

If you can look into the external link issue, that would be lovely, but honestly, knowing what you just told, I am perfectly fine reloading it after template parsing. Our use-cases was not what your library was for, so let's not overcomplicate it too much :D The performance is still amazing, and I haven't even applied any form of caches .. so yeah, that won't be an issue :)

TrueBrain avatar Oct 16 '20 18:10 TrueBrain

To add to this story, which is not related to the ticket, but I just wanted to say:

Thank you so much for this library!

You might find it cool to know that it can parse all 5702 pages on https://wiki.openttd.org/ , which includes wikitext pages written over 10 years ago, by people who really did not know how to write wikitext :) It is one of the more dirtiest test-sets I have ever worked with, but I am really happy your library managed to get through it all :D So again: thank you!

TrueBrain avatar Oct 22 '20 18:10 TrueBrain