mwparserfromhell
mwparserfromhell copied to clipboard
Heading node should "eat" the following whitespace and newline
When headings are parsed, as far as I know, the wiki software requires that nothing except whitespace be on the line following the heading. So
==head==
Will be parsed as a heading by the wiki, but
==head== foo
is interpreted as raw text.
mwparserfromhell will emit a separate text node containing the following newline and any preceding spaces. But it's possible to remove this node, which then results in a parse tree that can't actually exist in the wiki: a Heading node without a following Text node beginning with a newline. The following newline and the preceding whitespace should really be implicit in the heading, so it should be "eaten" by the Heading node, rather than be converted into a separate text node. Maybe any whitespace should be preserved, but if so, it should be possible to strip it from the Heading node.
The node following the heading should be the first node on the next line, not the newline. If any non-whitespace intervenes between the heading and the newline, mwparserfromhell should not emit a heading at all but should parse it as inline text (possibly containing templates and such), just like the wiki software does.
I see what you're saying. Good suggestion; I'll get on it at some point soon.
The real problem here is that, while ==head== foo is always disallowed by MediaWiki, ==head== {{foo}} is only disallowed if {{foo}} contains text (and is not just adding a comment or something). mwparser has no way to detect this.
I think this brings up the problem of comments as whitespace as well - from what I can tell MediaWiki parses <!-- comment -->=heading= as an HTML comment and a heading, but mwparserfromhell parses the heading as text because there isn't a leading newline. I also think (but am not sure) that this means that MediaWiki could interpret {{template}}=heading= with =heading= as a heading, but only if {{template}} contains a comment with no whitespace outside of it. I've been running into these issues when trying to parse tables, and they're really obscure situations but I'm not sure how to handle them yet.
Yes, that seems correct. MediaWiki works by first substituting templates and removing HTML comments before it converts headings into real <h2>... etc tags. Thus, if a template on the same line as a heading is empty, the heading will still be parsed correctly (and furthermore, this is necessary for MediaWiki to parse headings located inside templates).
Since mwparser works in a fundamentally different way and we can never determine what the parse tree should truly look like, I think it's better to be safe than sorry with regards to determining whether something is a real heading or not. To that effect, I think foo ==head== and ==head== foo should always be disallowed (since they are always invalid), but {{foo}} ==head== and ==head== {{foo}} should be allowed since they only might be invalid. <!-- foo -->==head== is never invalid, so the fact that mwparser doesn't treat it correctly is a bug.
That sounds like a good strategy (though {{foo}} ==head== would also fail because of the extra whitespace, only {{foo}}==head== is valid). But with reference to the actual issue, I think its pretty much impossible to stop users from creating an incorrect parse tree if you allow them to delete and insert arbitrary nodes, though this seems like a situation in which it's easy to do so accidentally. I suppose it might be possible to create some way of including a meta-node or other way of tracking dependencies on these kind of whitespace/newline characters? That would probably make the API a lot more complicated.
This also causes problems when encountering things like:
= ∞*b then that implies a/b = ∞*0. A similar proof could be done with the
in the middle of a paragraph. This is interpreted as a heading 1 when it should be continuation of text.
That's from https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse/Questions/Archive_296&action=edit. The proposed solution in https://github.com/earwig/mwparserfromhell/issues/55#issuecomment-49209288 should cover this
I've 'fixed' it for now with https://en.wikipedia.org/w/index.php?title=Wikipedia%3ATeahouse%2FQuestions%2FArchive_296&type=revision&diff=695965028&oldid=655041713
Also note that the amount of blank lines after a heading does not matter, e.g.
== foo ==
bar
is parsed by MediaWiki as
<h2><span class="mw-headline" id="foo">foo</span></h2>
<p>bar
</p>