mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

The tokenizer incorrectly handles some difficult tag-related markup

Open earwig opened this issue 10 years ago • 14 comments

  1. Bold and italics that cross contexts are handled incorrectly, because the tree structure does not support overlapping nodes (for example, ''foo'''bar''baz''', or ''foo{{bar|baz''}}). Fixing this will probably be very difficult.
  2. Open tags that do not have a close tag before the parser reaches EOF are ignored, whereas some of them should be parsed (like bold and italics) and have some kind of "hidden close" flag set.
  3. MediaWiki counts the occurrences of ; in the block before any text and uses this as the maximum number of parsable :s after. The current implementation only allows one : regardless of how many ;s there are.
  4. MediaWiki prevents some tags from crossing certain contexts (italics and bold can't cross headings, for example) but this implementation has no such restriction.
  5. The parser only recognizes a space as the separator character between the URL and its link title in [ ] tags, but MediaWiki also accepts some other syntax (e.g. [http://example.com/''Example''] is valid).

1, 4, and 5 are high priority, whereas 2 is mid and 3 is low.

earwig avatar Aug 19 '13 08:08 earwig

Regarding (1), a line from MediaWiki's source:

            # ''Something [http://www.cool.com cool''] -->
            # <i>Something</i><a href="http://www.cool.com"..><i>cool></i></a>

earwig avatar Aug 21 '13 05:08 earwig

Also, this.

== Something ==
'' Hello, world!

== Something else ==
Lorem ipsum dolor sit amet.''

ghost avatar Oct 27 '13 06:10 ghost

So it seems italics/bold can't cross links but can cross templates. I need to figure exactly which nodes are restrictive.

earwig avatar Oct 28 '13 02:10 earwig

1946cf6

earwig avatar Oct 28 '13 03:10 earwig

Hi! There seems to be a case you've missed.

Bold (and italics I guess) are implicitly closed when wikitable cells end. E.g. http://wiki.teamliquid.net/starcraft2/index.php?title=2014_WCS_Season_1_Europe/Premier&oldid=687367

{| class="wikitable"
|width=190px bgcolor="{{RaceColor|p}}" align="center" | '''{{p}} Protoss ''(13)''
|width=190px bgcolor="{{RaceColor|t}}" align="center" | '''{{t}} Terran ''(8)''
|width=190px bgcolor="{{RaceColor|z}}" align="center" | '''{{z}} Zerg ''(11)''

gives

<table class="wikitable">
<tr>
<td width="190px" bgcolor="#B8F2B8" align="center"> <b><a href="/starcraft2/File:Picon_small.png" class="image" title="Protoss"><img alt="Protoss" src="/starcraft/images2/a/ab/Picon_small.png" width="17" height="15" /></a> Protoss <i>(13)</i></b>
</td>
<td width="190px" bgcolor="#B8B8F2" align="center"> <b><a href="/starcraft2/File:Ticon_small.png" class="image" title="Terran"><img alt="Terran" src="/starcraft/images2/9/9d/Ticon_small.png" width="17" height="15" /></a> Terran <i>(8)</i></b>
</td>
<td width="190px" bgcolor="#F2B8B8" align="center"> <b><a href="/starcraft2/File:Zicon_small.png" class="image" title="Zerg"><img alt="Zerg" src="/starcraft/images2/c/c9/Zicon_small.png" width="17" height="15" /></a> Zerg <i>(11)</i></b>
</td>

Prillan avatar Apr 18 '14 09:04 Prillan

Hmm... yeah, that's tough because the parser doesn't understand tables yet. I'll need to add that before this is fixable.

earwig avatar Apr 18 '14 17:04 earwig

Pulling in a workaround from #80: @earwig suggested passing skip_style_tags=True to mwparserfromhell.parse to work around @Prillan's issue. This worked perfectly.

To get this feature, I had to track the development version on github rather than the released version on PyPI. Here's the line from my requirements.txt:

-e git+https://github.com/earwig/mwparserfromhell.git#egg=mwparserfromhell

danvk avatar Sep 07 '14 14:09 danvk

Most of this is going to require an overhaul of how parsing is done (I finally have an idea how I'm going to do it, but it'll be a lot of work)... so pushing this back as the main task for v1.0.

earwig avatar May 23 '15 23:05 earwig

Consider this wikitext:

''foo
bar''

MediaWiki 1.26 parses this as

<i>foo</i>
bar

which suggests that style markup cannot span across multiple lines. mwparserfromhell does this the hard/old? way:

\n
<
      i
>
      foo\nbar
</
      i
>
\n

lahwaacz avatar Feb 23 '16 07:02 lahwaacz

Oh joy.

earwig avatar Feb 23 '16 17:02 earwig

almond.txt

The attached file is a reduced version of https://en.wikipedia.org/w/index.php?title=Almond&oldid=706024513. I'd like to reduce it more, but any structural change anywhere in the text makes the problem disappear, so I don't know if this is actually an instance of this bug.

The initial table is parsed correctly, subject to point 2 above, i.e. the unclosed <small> and <center> tags are returned as plain text. But everything after the table is returned as plain text too, with the exception of headings and lists. For example:

===
       Almond flour and skins
===
\n[[Almond flour]] is often used as a [[gluten-free]] alternative to wheat flour

Replicating the initial line, like this:

{|
|-
| Production<small>(million tonnes)
|-
| Production<small>(million tonnes)
|-
| {{flag|USA}} || style="text-align:center;"|<center> 1.8
|-

Results in the rest of the table not being parsed either:


      
            
                   Production<small>(million tonnes)\n
            </
                  td
            >
      </
            tr
      >
      |-\n| Production<small>(million tonnes)\n|-\n| {{flag|USA}} || style="text-align:center;"|<center> 1.8\n|-\n| {{flag|Australia}} || style="text-align:center;"|<center> 0.16\n|-\n| {{flag|Spain}} || style="text-align:center;" |<center> 0.15\n|-\n| {{flag|Morocco}} || style="text-align:center;"|<center> 0.1\n|-\n| {{flag|Iran}} || style="text-align:center;"|<center> 0.09\n|-\n!'''World''' !! style="text-align:center;"|<center> '''2.92'''\n
</
      table
>

mhsmith avatar Apr 08 '16 10:04 mhsmith

Here's a really weird example from https://fr.wikipedia.org/w/index.php?title=Opposition_p%C3%A9rih%C3%A9lique&oldid=112493222 :

[[Image:Opposition périhélique.PNG|thumb|250px|Schéma présentant les oppositions périhélique et aphélique de la {{quoi|[[Terre]] et de [[Mars (planète)|Mars]]]]
On dit que deux corps célestes sont en '''opposition périhélique''' lorsque tous deux sont simultanément au [[périhélie]] de leur orbite en alignement parfait avec le [[Soleil]]. Il en résulte que la distance entre ces deux corps célestes est alors minimale.}}

With the template interrupted by the end of the image context, MediaWiki appears to actually invoke the template twice in order to achieve the author's (presumed) intention.

mhsmith avatar Apr 09 '16 14:04 mhsmith

Answer on #148 Perhaps ... Many of pages with this issue AWB marks as "have unclosed tags". But not all, e.g. no a tag errors in https://ru.wikipedia.org/w/index.php?title=%D0%9B%D0%B8%D0%BC%D0%BE%D0%BD&oldid=76351442. This page without errors too.

Tables placed in one sections of pages, but parser doesn't see templates in other sections. Could add function recognition "== ==" as secondary mark end of tables?

vladiscripts avatar Apr 18 '16 08:04 vladiscripts

Other weird ones with malformed italics in templates:

mwparserfromhell.parse("{{foo|''bar}} {{foo|bar''}}").filter_templates()
# => ["{{foo|''bar}}", "{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''...'' {{foo|bar''}}").filter_templates()
# => ["{{foo|bar''}}"]

mwparserfromhell.parse("{{foo|''bar}} ''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}} ''bar''").filter_templates()
# => []

mwparserfromhell.parse("{{foo|''bar}}").filter_templates()
# => ["{{foo|''bar}}"]

bfontaine avatar Jan 12 '21 13:01 bfontaine