mwparserfromhell List tags should include the item text in their contents

List tags should include the item text in their contents

Open Rua opened this issue 10 years ago • 16 comments

When parsing ordered or unordered lists, the actual text in the list item is returned as a string of separate nodes of various following the "Tag" node, all belonging to the same parent. For example:

# this {{is}}
# a [[test]]

filter(recursive=False) will now give these node types (unicode equivalent in brackets): Tag (#), Text ( this ), Template ({{is}}), Text (\n), Tag (#), Text ( a ), Wikilink ([[test]]), Text (\n)

Note that the last Text node will contain the newline that actually ends the list item. The parser should parse this the same way, by including everything on that line as part of the Tag node's .contents property. So this should really return: Tag (# this {{is}}\n), Tag (# a [[test]]\n)

Similar behaviour should also apply to unordered lists (*) and to definition lists (; and :).

Sep 05 '13 19:09 Rua

I thought about this when originally coding list support; I decided against it to make the implementation easier, but you're right that it should be fixed in the long term.

Sep 05 '13 23:09 earwig

Ideally there would IMO be a "ul", "ol" or "dd" tag which contains all of the list items.

Mar 28 '14 11:03 jfolz

Yes, that is another component to it.

Mar 28 '14 17:03 earwig

I've been looking at this problem some more, and a few interesting things came up.

MediaWiki handles it well when : # and * are placed together on one line, as they all have the same terminating string (newline). But weirder things happen when you throw ; into the mix with # or *.

For example:

;* foo : bar

I would say the "correct" or at least most intuitive behaviour is to produce this HTML:

<dl>
<dt><ul><li>foo</li></ul></dt>
<dd>bar</dd>
</dl>

However, MediaWiki produces this HTML instead:

<dl>
<dt>foo&#160;</dt>
<dd><ul><li>bar</li></ul></dd>
</dl>

It appears that MediaWiki doesn't know how to nest this right, and makes a mess of it. It has similar behaviours with ;# as well, and ;* apparently gets treated as identical to :*.

The very similar-looking

*; foo : bar

Is correctly parsed by MediaWiki, however.

So when this function gets implemented, this is something that needs to be considered: MediaWiki doesn't like regular lists nested inside definition terms.

Oct 25 '15 20:10 Rua

Having worked at this more, I am also thinking about the best implementation. When multiple levels of lists are stacked, like

#*:

there isn't that much of an advantage in nesting these. All the content will end up in the inner node, which is not particularly practical.

So I think that the way such lists are parsed should be changed, so that the entire combination of list tags is included in one Tag node, with the wikicode of the node simply set to all the tags together, so "#*:" in this case. The contents of the node would then be set to whatever follows, up to the final newline or : (in the case of definition lists). If the contents ends in a newline, then this newline should be included as the closing wikitext of the Tag node, rather than being part of a following text node.

How does this sound?

Oct 31 '15 16:10 Rua

IMHO nesting the nodes is the most natural thing to do. Why exactly would you not do it? If the levels are stacked like you described, traversing the AST would be much harder because the number of list types would be infinite. Also, it conflicts with this suggestion as there would not be any node containing all the list items (at given level).

Oct 31 '15 17:10 lahwaacz

Agree that we should nest nodes; the alternative seems too unexpected in most cases. Since there are plans for multiple parsing "modes" to solve various other issues, it's possible we can support both nesting and coalescing, but I'm not sure if the added complexity is worth it.

Oct 31 '15 20:10 earwig

What is the status of this? As it is, since list contents are not nested, there's no easy way to remove a list (or a list item) completely from the AST. If you remove the list tag node, you just leave behind the content.

Mar 09 '17 03:03 BrenBarn

No updates, sorry. You are right that it presents a substantial limitation. I won't have time to work on this project for a while.

Mar 09 '17 05:03 earwig

For what it's worth, I wrote this code for grouping the nodes into list items a while ago.

I haven't got around to merging that yet on my own project, so keep in mind that it's just a barely-tested hack, but it might help you identify individual items and remove them.

Mar 09 '17 08:03 eggpi

I took a look at the parsing code. There seem to be several situations where linebreaks really ought to be considered tokens, but are not. This results in parsing that differs from MediaWiki. For instance, in Media Wiki, if you have (admittedly perverse) markup like this:

Some stuff with ''formatting with newline
in it'' and other stuff

Then what you get is

Some stuff with formatting with newline in it and other stuff

That is, a line break ends the italic formatting. We also need the line break to end the list item.

Mar 11 '17 03:03 BrenBarn

Oh, I didn't realize that. Okay... well, that makes possible solutions a bit easier. Line breaks likely should be treated as tokens then.

Mar 11 '17 19:03 earwig

The linebreaks problem is most likely related to https://github.com/earwig/mwparserfromhell/issues/40#issuecomment-187586605.

Also note that linebreaks can be considered tokens only sometimes. To make the above snippet even more perverse, consider this:

Some stuff with ''formatting with <pre>newline
in</pre> it'' and other stuff

Then you'll get this:

Some stuff with <i>formatting with <pre>newline
in</pre> it</i> and other stuff

Note that some tags are not like the others, style markup is terminated at linebreaks inside <span>, but not inside <pre>.

Mar 11 '17 19:03 lahwaacz

That's true, although I think it's not unreasonable to punt on handling <pre> since it's kind of a different kettle of fish. Even other HTML tags aren't interpreted inside <pre>.

Mar 12 '17 02:03 BrenBarn

Obviously at least <nowiki> behaves the same as <pre> here and who knows how each extension tag behaves. There might also be other inconsistencies we haven't found yet.

Mar 12 '17 08:03 lahwaacz

Is there a workaround to ignore/get the content of the List in wikicode for the issue in #10

Feb 03 '20 22:02 talalmts

mwparserfromhell mwparserfromhell copied to clipboard

List tags should include the item text in their contents

mwparserfromhell
mwparserfromhell copied to clipboard