wordpress-develop HTML API: Allow additional fragment contexts.

Trac-ticket: Core-61576.

Status

[ ] ~Add unit tests covering significant contexts, e.g. SCRIPT, TEXTAREA, SVG.~
- It shouldn't be possible to create fragment parsers on void or self-contained elements. There's no content inside of them to parse. For self-contained elements it's appropriate to rely on the eventual set_inner_html() which will be an alias in those cases for set_inner_text().
[ ] Review all documentation looking for places stating that <body> is the only supported context.

Description

Previously, the fragment parser in WP_HTML_Processor has only allowed creating a fragment with the <body> context. In this patch, any context node is allowed.

Aug 05 '24 22:08 dmsnell

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

The Plugin and Theme Directories cannot be accessed within Playground.
All changes will be lost when closing a tab with a Playground instance.
All changes will be lost when refreshing the page.
A fresh instance is created each time the link below is clicked.
Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

Aug 05 '24 22:08 github-actions[bot]

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, apermo, jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

Aug 05 '24 23:08 github-actions[bot]

@sirreal it looks like we may have a lot of work to do before we can open this up to allowing self-contained elements as the fragment context. there are multiple challenges:

we have to enter within a certain tokenizer state, but that's not how our parser is designed.
the DOM is able to store things like the text </textarea> inside a TEXTAREA node, but HTML is not able to do this. we have to determine what to do in these cases.
if the closing tag for the self-contained element isn't provided, we cannot parse it to figure out which context node it is. e.g. <script> bails with the tag processor.

Aug 06 '24 05:08 dmsnell

The fragment parser takes an Element. An HTML string seems like a poor representation for that 🙂

I wonder if the fragment parser could take a token (to create a fragment from an existing document) or some other representation that's not just a string in order to create fragments without needing to create a full document first.

The following steps form the HTML fragment parsing algorithm. The algorithm takes as input an Element node, referred to as the context element, which gives the context for the parser, input, a string to parse, and an optional boolean allowDeclarativeShadowRoots (default false). It returns a list of zero or more nodes.

Aug 12 '24 12:08 sirreal

@sirreal: upon further review I feel like this is not appropriate for self-contained and void elements, so I am rejecting them. there's never a reason, I think, to create a fragment parser on an element which can contain no child nodes (other than text, potentially).

One case I was specifically thinking through is a <script> context. The idea is simple: parse the inner HTML of a SCRIPT element. The problem is when </script> appears and there is more content. In the DOM you can actually create a text node containing </script> but this is not possible in the HTML. The spec even notes situations like this where the DOM can create trees that cannot be expressed in HTML.

So I think that long-term the only use for the fragment parser outside of the usual is when setting inner HTML, and in that case, it seems more fitting for that function to call out something special for elements like SCRIPT so that it falls into the equivalent of ->skip_script_data().

Apart from this I don't like exposing a token as the context node because that's cumbersome for people to write, and I don't expect people to want to go through the hassle of what that implies. Our tokens don't even contain attribute references, so this is not currently workable.

However, it does potentially solve some issues around namespacing. There's no way in this function's argument to specify a context node in the svg or math namespace.

Aug 21 '24 02:08 dmsnell

I'll share some context from a conversation.

As mentioned, it likely doesn't make sense to create a fragment parser in order to modify these "self-contained" tags and better interfaces can be provided.

The specification for fragment parsers are always created from another document. This should be easy to do in the HTML API. However, most of the time WordPress is inspecting or modifying HTML snippets without considering their context. Rarely are full HTML documents considered. In order to continue supporting this behavior, we need to be able to create fragments without a parent document. This is where most of the difficulty arises. Most of the time, <body> provides a suitable context, but there could be other cases like <head>. Again, this seems relatively straightforward to support. However, it gets more complicated with things like <svg> as a context element, or especially <rect>, or even trickier a script tag in the SVG namespace!

The context element will be used to establish the correct insertion mode and the context element's namespace and possibly attributes determine how tokens should be handled (using rules for foreign content or one of the HTML insertion modes).

It may make sense to maintain this interface as-is. <body> is likely a good context for most use-cases but in order to properly support fragments some advanced interface likely needs to support:

Passing a context element along with its tag name, namespace, and attributes.
The context element's document determines the quirks mode of the fragment parser.

Aug 22 '24 13:08 sirreal

in order to properly support fragments some advanced interface likely needs to support…

I had a thought that may be helpful.

I'd like to add an instance method to the HTML processor like create_fragment_parser_at_node( string $html_fragment ). This method would return a new HTML processor, created as a fragment parser with the context element created from the current tag opener (or null). This would allow use to set the document compat type and the context element and its attributes correctly according to the specification.

Once we have this method, advanced fragment parser creation could be handled by that method:

$full_parser = WP_HTML_Processor::create_full_parser( '<!DOCTYPE html><math><annotation-xml encoding="text/html">' );
$full_parser->next_tag( 'ANNOTATION-XML' );
$fragment_parser = $full_parser->create_fragment_parser_at_node( `<h1>Who knows what happens here?' );
$fragment_parser->next_tag( 'H1' );
// …

I think internally this method would be used for things like set_inner_html that require a fragment parser.

Aug 26 '24 19:08 sirreal

See #7348

Sep 13 '24 11:09 sirreal

https://github.com/WordPress/wordpress-develop/pull/7777 (on top of https://github.com/WordPress/wordpress-develop/pull/7348) adds the ability for advanced fragments, for example WP_Html_Processor::create_fragment( '<rect/>', '<svg>' ) will do what you expect 🙂

Nov 12 '24 15:11 sirreal

https://github.com/WordPress/wordpress-develop/pull/7777 is open for review. That would supersede this work.

Nov 21 '24 19:11 sirreal

https://core.trac.wordpress.org/changeset/59467 / https://github.com/WordPress/wordpress-develop/pull/7777 have landed.

I think this PR can be closed.

Nov 27 '24 15:11 sirreal