mwparserfromhell Heading node should "eat" the following whitespace and newline

When headings are parsed, as far as I know, the wiki software requires that nothing except whitespace be on the line following the heading. So

==head==

Will be parsed as a heading by the wiki, but

==head== foo

is interpreted as raw text.

mwparserfromhell will emit a separate text node containing the following newline and any preceding spaces. But it's possible to remove this node, which then results in a parse tree that can't actually exist in the wiki: a Heading node without a following Text node beginning with a newline. The following newline and the preceding whitespace should really be implicit in the heading, so it should be "eaten" by the Heading node, rather than be converted into a separate text node. Maybe any whitespace should be preserved, but if so, it should be possible to strip it from the Heading node.

The node following the heading should be the first node on the next line, not the newline. If any non-whitespace intervenes between the heading and the newline, mwparserfromhell should not emit a heading at all but should parse it as inline text (possibly containing templates and such), just like the wiki software does.

Dec 15 '13 22:12 Rua

I see what you're saying. Good suggestion; I'll get on it at some point soon.

Dec 15 '13 22:12 earwig

The real problem here is that, while ==head== foo is always disallowed by MediaWiki, ==head== {{foo}} is only disallowed if {{foo}} contains text (and is not just adding a comment or something). mwparser has no way to detect this.

May 26 '14 08:05 earwig

I think this brings up the problem of comments as whitespace as well - from what I can tell MediaWiki parses =heading= as an HTML comment and a heading, but mwparserfromhell parses the heading as text because there isn't a leading newline. I also think (but am not sure) that this means that MediaWiki could interpret {{template}}=heading= with =heading= as a heading, but only if {{template}} contains a comment with no whitespace outside of it. I've been running into these issues when trying to parse tables, and they're really obscure situations but I'm not sure how to handle them yet.

Jul 16 '14 18:07 davidswinegar

Yes, that seems correct. MediaWiki works by first substituting templates and removing HTML comments before it converts headings into real <h2>... etc tags. Thus, if a template on the same line as a heading is empty, the heading will still be parsed correctly (and furthermore, this is necessary for MediaWiki to parse headings located inside templates).

Since mwparser works in a fundamentally different way and we can never determine what the parse tree should truly look like, I think it's better to be safe than sorry with regards to determining whether something is a real heading or not. To that effect, I think foo ==head== and ==head== foo should always be disallowed (since they are always invalid), but {{foo}} ==head== and ==head== {{foo}} should be allowed since they only might be invalid. ==head== is never invalid, so the fact that mwparser doesn't treat it correctly is a bug.

Jul 16 '14 18:07 earwig

That sounds like a good strategy (though {{foo}} ==head== would also fail because of the extra whitespace, only {{foo}}==head== is valid). But with reference to the actual issue, I think its pretty much impossible to stop users from creating an incorrect parse tree if you allow them to delete and insert arbitrary nodes, though this seems like a situation in which it's easy to do so accidentally. I suppose it might be possible to create some way of including a meta-node or other way of tracking dependencies on these kind of whitespace/newline characters? That would probably make the API a lot more complicated.

Jul 16 '14 19:07 davidswinegar

This also causes problems when encountering things like:

= ∞*b  then that implies a/b = ∞*0.  A similar proof could be done with the

in the middle of a paragraph. This is interpreted as a heading 1 when it should be continuation of text.

That's from https://en.wikipedia.org/w/index.php?title=Wikipedia:Teahouse/Questions/Archive_296&action=edit. The proposed solution in https://github.com/earwig/mwparserfromhell/issues/55#issuecomment-49209288 should cover this

Dec 19 '15 23:12 yuvipanda

I've 'fixed' it for now with https://en.wikipedia.org/w/index.php?title=Wikipedia%3ATeahouse%2FQuestions%2FArchive_296&type=revision&diff=695965028&oldid=655041713

Dec 19 '15 23:12 yuvipanda

Also note that the amount of blank lines after a heading does not matter, e.g.

== foo ==





bar

is parsed by MediaWiki as

<h2><span class="mw-headline" id="foo">foo</span></h2>
<p>bar
</p>

Jun 28 '16 07:06 lahwaacz

mwparserfromhell mwparserfromhell copied to clipboard

Heading node should "eat" the following whitespace and newline

mwparserfromhell
mwparserfromhell copied to clipboard