mwoffliner
mwoffliner copied to clipboard
Improper scraping of redlink in Basque wikipedia
On the Basque Wikipedia (wikipedia_eu_all_novid_2018-10), some red internal links are not properly parsed, see e.g. that on Kreischa:
I checked and there never was a typo in the first place. Interestingly, the same redlink appears as normal text (ie no problem) in the same article's infobox.
I found a similar instance in the Rödinghausen article, whereby the article points to [[Herford (barrutia)|Herford]] and writes the whole wikicode as is. I guess then this has to do with brackets, though I did not see the same happen with for example the French zim (2018_10 as well).
I will retry that one
@ISNIT0 Problem is still there. Source code is a bit special here, as it use Wikidata, see https://eu.wikipedia.org/w/index.php?title=Kreischa&action=edit. But mobile view works fine, so it should work for us too.
What is the desired behaviour here? For the text to be red, or for the [[
to be gone? or both?
@ISNIT0 Well like for any other wiki with redlinks the square brackets [[
should be gone and the text before the pipe |
as well: we should read
Saxonia estratuankokatuta dago, Sächsische Schweiz-Pserzgebirge barrutian.
Okay, so the text should also be red, right?
On the zim file? No, the normal behaviour is that links that lead nowhere (ie not another article) are not shown. Why should we change?
Ok, I looked at the source code and boy this is a mess. Good luck.
Looks like a bug on MCS/Parsoid, I've created a Phabricator task: https://phabricator.wikimedia.org/T226523
Response from ssastry:
At this point, I am tempted to say, this is pretty much a won't support scenario. Ideally, we would detect this wikitext pattern and flag it for wikis to fix their wikitext so that code can be supported unless of course this usage is very common practice. But, given that it has taken these many years for someone to notice this breakage indicates that this is likely not very common. If we want to proceed down the path of independent parsing futures, [[ {{template}} ]] will only parse as a link in the case where {[template}} yields a valid link (which we currently support). But if the templateyields pieces of syntax that has to be combined with other syntax from the top-level page that then happens to resemble a wikilink, then that is not something we want to encourage and support going forward.
There's not much more we can do from our end (except a custom wikitext parser/post-processor which is a bit out of scope I think).
Can we just fix the wikitext?
@kelson42 Removing milestone as this is now out of our hands
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
Still buggy
Bug is still there, look at https://library.kiwix.org/wikipedia_eu_all_maxi/A/Kreischa and https://library.kiwix.org/wikipedia_eu_all_maxi/A/R%C3%B6dinghausen
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.