mwparserfromhell icon indicating copy to clipboard operation
mwparserfromhell copied to clipboard

Bug: Table attributes are parsed as Text

Open Eccenux opened this issue 3 years ago • 5 comments

There seem to be something weird going on when parsing Tables. I get attributes as Text nodes.

Let's consider this example:

import mwparserfromhell
text = """
{| class="wikitable"
|-
! style="width:101px" | heading1
! style="width:102px" | heading2
|- data-test="whatever"
| style="width:201px" | testing
|}
""".strip()
code = mwparserfromhell.parse(text)
# code.filter_tags(matches=lambda node: node.tag == "table")
# print (code.get_tree())

for node in code.filter():
	if isinstance(node, mwparserfromhell.nodes.tag.Tag):
		print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
	else:
		print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')

This how the text is rendered by MW:

<table class="wikitable">
	<tr>
		<th style="width:101px">heading1</th>
		<th style="width:102px">heading2</th>
	</tr>
	<tr data-test="whatever">
		<td style="width:201px">testing</td>
	</tr>
</table>

So there are 3 cells here. In HTML textContent would be heading1, heading2, testing.

But in MWPFH I get Text nodes for attributes. For example:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
 style
"""

[Text]
"""
 width:201px
"""

[Text]
"""
  testing

"""

I would expect those attributes wouldn't appear as Text nodes at all. They are already available node.attributes so that should be enough.

So when traversing above cell, it should only show:

[Tag](td)
"""
 | style="width:201px" | testing

"""

[Text]
"""
  testing

"""

Eccenux avatar Jun 05 '22 23:06 Eccenux

You are doing a recursive filter on the wikicode. Those text nodes are nested inside the tag nodes. The tree of your snippet looks correct:

>>> print(code.get_tree())
<
      table
      class
    = wikitable
>
      <
            tr
      >
            <
                  th
                  style
                = width:101px
            >
                   heading1\n
            </
                  th
            >
            <
                  th
                  style
                = width:102px
            >
                   heading2\n
            </
                  th
            >
      </
            tr
      >
      <
            tr
            data-test
          = whatever
      >
            <
                  td
                  style
                = width:201px
            >
                   testing\n
            </
                  td
            >
      </
            tr
      >
</
      table
>

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

lahwaacz avatar Jun 27 '22 13:06 lahwaacz

The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

Eccenux avatar Jun 27 '22 21:06 Eccenux

I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.

Then you need to adjust your recursive iteration. For example:

for node in code.filter():
    parent = code.get_parent(node)
    if isinstance(parent, mwparserfromhell.nodes.tag.Tag) and not parent.contents.contains(node):
        print(f"parent of '{node}' is a tag and the node is not in the tag's contents, skipped")
        continue
    if isinstance(node, mwparserfromhell.nodes.tag.Tag):
        print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
    else:
        print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')

lahwaacz avatar Jun 27 '22 21:06 lahwaacz

@lahwaacz is right, but maybe there's room here for improvement. We might want to distinguish between Text nodes that are "visible"/rendered and Text nodes that are purely internal like template parameter names and tag attributes.

It's basically the problem being solved by strip_code but generalized for other use cases, to allow a way to filter VisibleText nodes or something.

earwig avatar Jun 30 '22 03:06 earwig

@earwig, :+1: because the loop like

for node in code.filter():
    parent = code.get_parent(node)

that is needed for this is not efficient as it basically iterates again for each node, c.f. #195.

lahwaacz avatar Jun 30 '22 05:06 lahwaacz