mwoffliner icon indicating copy to clipboard operation
mwoffliner copied to clipboard

Improper scraping of redlink in Basque wikipedia

Open Popolechien opened this issue 5 years ago • 15 comments

On the Basque Wikipedia (wikipedia_eu_all_novid_2018-10), some red internal links are not properly parsed, see e.g. that on Kreischa: Capture d’écran 2019-05-01 à 10 11 28 I checked and there never was a typo in the first place. Interestingly, the same redlink appears as normal text (ie no problem) in the same article's infobox.

I found a similar instance in the Rödinghausen article, whereby the article points to [[Herford (barrutia)|Herford]] and writes the whole wikicode as is. I guess then this has to do with brackets, though I did not see the same happen with for example the French zim (2018_10 as well).

Popolechien avatar May 01 '19 08:05 Popolechien

I will retry that one

kelson42 avatar Jun 15 '19 10:06 kelson42

@ISNIT0 Problem is still there. Source code is a bit special here, as it use Wikidata, see https://eu.wikipedia.org/w/index.php?title=Kreischa&action=edit. But mobile view works fine, so it should work for us too.

kelson42 avatar Jun 16 '19 05:06 kelson42

What is the desired behaviour here? For the text to be red, or for the [[ to be gone? or both?

ISNIT0 avatar Jun 25 '19 15:06 ISNIT0

@ISNIT0 Well like for any other wiki with redlinks the square brackets [[ should be gone and the text before the pipe | as well: we should read

Saxonia estratuankokatuta dago, Sächsische Schweiz-Pserzgebirge barrutian.

Popolechien avatar Jun 25 '19 15:06 Popolechien

Okay, so the text should also be red, right?

ISNIT0 avatar Jun 25 '19 15:06 ISNIT0

On the zim file? No, the normal behaviour is that links that lead nowhere (ie not another article) are not shown. Why should we change?

Popolechien avatar Jun 25 '19 15:06 Popolechien

Ok, I looked at the source code and boy this is a mess. Good luck.

Popolechien avatar Jun 25 '19 15:06 Popolechien

Looks like a bug on MCS/Parsoid, I've created a Phabricator task: https://phabricator.wikimedia.org/T226523

ISNIT0 avatar Jun 25 '19 15:06 ISNIT0

Response from ssastry:

At this point, I am tempted to say, this is pretty much a won't support scenario. Ideally, we would detect this wikitext pattern and flag it for wikis to fix their wikitext so that code can be supported unless of course this usage is very common practice. But, given that it has taken these many years for someone to notice this breakage indicates that this is likely not very common. If we want to proceed down the path of independent parsing futures, [[ {{template}} ]] will only parse as a link in the case where {[template}} yields a valid link (which we currently support). But if the templateyields pieces of syntax that has to be combined with other syntax from the top-level page that then happens to resemble a wikilink, then that is not something we want to encourage and support going forward.

There's not much more we can do from our end (except a custom wikitext parser/post-processor which is a bit out of scope I think).

Can we just fix the wikitext?

ISNIT0 avatar Jun 26 '19 07:06 ISNIT0

@kelson42 Removing milestone as this is now out of our hands

ISNIT0 avatar Jul 01 '19 17:07 ISNIT0

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Oct 01 '19 07:10 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Jul 01 '20 00:07 stale[bot]

Still buggy

kelson42 avatar Jun 12 '22 15:06 kelson42

Bug is still there, look at https://library.kiwix.org/wikipedia_eu_all_maxi/A/Kreischa and https://library.kiwix.org/wikipedia_eu_all_maxi/A/R%C3%B6dinghausen

kelson42 avatar Jul 10 '22 08:07 kelson42

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 03:09 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

stale[bot] avatar May 26 '23 17:05 stale[bot]