mwparserfromhell
mwparserfromhell copied to clipboard
Bug: Table attributes are parsed as Text
There seem to be something weird going on when parsing Tables. I get attributes as Text nodes.
Let's consider this example:
import mwparserfromhell
text = """
{| class="wikitable"
|-
! style="width:101px" | heading1
! style="width:102px" | heading2
|- data-test="whatever"
| style="width:201px" | testing
|}
""".strip()
code = mwparserfromhell.parse(text)
# code.filter_tags(matches=lambda node: node.tag == "table")
# print (code.get_tree())
for node in code.filter():
if isinstance(node, mwparserfromhell.nodes.tag.Tag):
print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
else:
print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')
This how the text is rendered by MW:
<table class="wikitable">
<tr>
<th style="width:101px">heading1</th>
<th style="width:102px">heading2</th>
</tr>
<tr data-test="whatever">
<td style="width:201px">testing</td>
</tr>
</table>
So there are 3 cells here. In HTML textContent would be heading1, heading2, testing.
But in MWPFH I get Text nodes for attributes. For example:
[Tag](td)
"""
| style="width:201px" | testing
"""
[Text]
"""
style
"""
[Text]
"""
width:201px
"""
[Text]
"""
testing
"""
I would expect those attributes wouldn't appear as Text nodes at all. They are already available node.attributes so that should be enough.
So when traversing above cell, it should only show:
[Tag](td)
"""
| style="width:201px" | testing
"""
[Text]
"""
testing
"""
You are doing a recursive filter on the wikicode. Those text nodes are nested inside the tag nodes. The tree of your snippet looks correct:
>>> print(code.get_tree())
<
table
class
= wikitable
>
<
tr
>
<
th
style
= width:101px
>
heading1\n
</
th
>
<
th
style
= width:102px
>
heading2\n
</
th
>
</
tr
>
<
tr
data-test
= whatever
>
<
td
style
= width:201px
>
testing\n
</
td
>
</
tr
>
</
table
>
The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.
The table attributes are parsed as text because MediaWiki parses table attributes before producing the table. For example, table attributes can contain templates and other wikicode. So I would say this is not a bug in mwparserfromhell.
I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.
I don't insist in calling it a bug, but still it doesn't make sense when parsing stuff with a bot. When I want change something in texts I explicitly don't want to mess with table attributes.
Then you need to adjust your recursive iteration. For example:
for node in code.filter():
parent = code.get_parent(node)
if isinstance(parent, mwparserfromhell.nodes.tag.Tag) and not parent.contents.contains(node):
print(f"parent of '{node}' is a tag and the node is not in the tag's contents, skipped")
continue
if isinstance(node, mwparserfromhell.nodes.tag.Tag):
print(f'[{node.__class__.__name__}]({node.tag})\n"""\n', node, '\n""""')
else:
print(f'[{node.__class__.__name__}]\n"""\n', node, '\n""""')
@lahwaacz is right, but maybe there's room here for improvement. We might want to distinguish between Text nodes that are "visible"/rendered and Text nodes that are purely internal like template parameter names and tag attributes.
It's basically the problem being solved by strip_code but generalized for other use cases, to allow a way to filter VisibleText nodes or something.
@earwig, :+1: because the loop like
for node in code.filter():
parent = code.get_parent(node)
that is needed for this is not efficient as it basically iterates again for each node, c.f. #195.