wordpress-develop icon indicating copy to clipboard operation
wordpress-develop copied to clipboard

HTML API: Handle parsing changes in foreign content.

Open dmsnell opened this issue 2 years ago • 2 comments

Trac ticket: Core-61576

Status

There is an infinite loop and some mis-nesting showing that something is missing in the implementation, but I'm not sure what it is yet.

Earlier versions

  • First draft (ba2c8896f08ee36d3c61a500a86422b736a00923)

    • By implementing the rules at the top of Tree Construction/Dispatch, we don't need $skip_next_foreign_content_processing, because we pop nodes off of the stack of open elements until we're back at an html element, or an HTML integration point, or a MathML integration point. Returning to step() will choose insertion mode instead of foreign content and prevent an infinite loop.
  • Second draft (3e4bb4d011100c3a5641ebffd1beb09b0d808f15)

    • In this draft the attribute and tag names are properly cased, but it throws off the logic mechanisms. Since attributes and tags are still parsed according to HTML rules (ASCII-case-insensitive), it makes sense to continue reporting them in all caps. Perhaps a new method such as get_token_display_name() would provide the remapping.

Notes

  • There's an interplay we have to figure out with casing of tag names. For all of our comparison functions we check the upper-case variants, but there's value in reporting the foreign content tag names as lower-cased or as mixed case.
  • There are some SVG tag names that are supposed to report in mixed case. For both SVG and MathML there are attributes which are supposed to report in mixed case. Unlike my first assumption, however, they still all parse with ASCII case insensitivity. That is, <circle clippath=1 clipPath=2 CLIPPATH=3 cliPPath=4> has a single attribute named clipPath and whose value is 1. This is convenient because we don't have to change the Tag Processor, but inconvenient when things are mostly based on lower-case attribute names.
  • We probably need to repeal the idea that tag names are upper-case and add another bit to communicate tag name vs. doctype declaration. Maybe? Can we test that math:mi or math:MI is in the math namespace? There can be no HTML tag named math:mi (it would be MATH:MI).
  • Could be handy to have something like base_class_get_tag() or private function comparison_tag_name() etc… to report an upper-case tag name, while preserving the case-variants required in foreign content to outside calls.
  • Tracking the namespace seems a bit fishy. When we're in the Tag Processor we can know that if we encounter an SVG or MATH tag when we're in the html namespace, that it should change the namespace to svg or math, respectively, and lower-case the tag names. However, the role of integration points and parsing things in the insertion mode is still vague.
  • Foreign content is not an insertion mode. This is important when handling HTML tags inside HTML integration points. This is very important for get_modifiable_text() which doesn't know if a text node inside foreign content is being processed as foreign content or in the insertion mode, where it determines if NULL bytes should be replaced or removed.
  • It's probably necessary to start over again with these insights and more in mind.

Description

We should reliable detect foreign content and we need to do it in the Tag Processor, specifically because of the rules for CDATA sections. The HTML Processor needs this as well to determine if things like self-closing flags for HTML elements should be respected.

- Tests: 1525, Assertions: 3277, Skipped: 233.
+ Tests: 1502, Assertions: 3312, Errors: 13, Failures: 20, Skipped: 157.

Unlocks:

  • proper detection of CDATA
  • parsing of SVG and MathML
  • indicating if one should expect a closing tag or not
Screenshot 2024-02-01 at 9 49 43 PM

Screenshot 2024-02-02 at 12 13 47 AM

dmsnell avatar Feb 01 '24 20:02 dmsnell

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props dmsnell, jonsurrell, westonruter.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

github-actions[bot] avatar Feb 01 '24 20:02 github-actions[bot]

Test using WordPress Playground

The changes in this pull request can previewed and tested using a WordPress Playground instance.

WordPress Playground is an experimental project that creates a full WordPress instance entirely within the browser.

Some things to be aware of

  • The Plugin and Theme Directories cannot be accessed within Playground.
  • All changes will be lost when closing a tab with a Playground instance.
  • All changes will be lost when refreshing the page.
  • A fresh instance is created each time the link below is clicked.
  • Every time this pull request is updated, a new ZIP file containing all changes is created. If changes are not reflected in the Playground instance, it's possible that the most recent build failed, or has not completed. Check the list of workflow runs to be sure.

For more details about these limitations and more, check out the Limitations page in the WordPress Playground documentation.

Test this pull request with WordPress Playground.

github-actions[bot] avatar Feb 01 '24 20:02 github-actions[bot]

@sirreal @westonruter I've pushed some updates that incorporate trunk and a few other fixes. notably, I had the wrong code for adjusted_current_node().

dmsnell avatar Jul 31 '24 21:07 dmsnell

This is very tricky. I'll share some cases that are problematic right now:

Update: I've debugged all of these cases and pushed some changes to fix them.

Fixed

<svg><a/>Text after "svg:a" self closing tag.

Errors with Cannot run adoption agency when "any other end tag" is required.

<svg></svg><![CDATA[ bogus comment ]]>

svg :svg ! CDATA  bogus comment

<svg><foreignObject><![CDATA[ bogus comment ]]>

svg foreignObject ! CDATA  bogus comment

<svg><title><div>

svg title div

<svg><title><rect><div>

svg title rect div

<svg><title><svg><div>

svg title svg div

<math><mi><div></div></mi><mi>

math mi div :div :mi mi

<math><mi><svg><foreignObject><div>

math mi svg foreignObject div

<svg><foreignObject>\x00filler\x00text (null bytes)

svg foreignObject x00filler x00text

<svg></br><foo>

svg :br foo

<math><mtext><b>

math mtext b

sirreal avatar Aug 01 '24 09:08 sirreal

I've made some progress on this fixing some of the issues mentioned above: https://github.com/WordPress/wordpress-develop/pull/6006#issuecomment-2262585496

Right now some self-closing tags in foreign content break. HTML like <svg><a/> seems to enter step_in_body, runs the adoption agency algorithm, and fails:

WP_HTML_Unsupported_Exception: Cannot run adoption agency when "any other end tag" is required. in /var/www/html/wp-includes/html-api/class-wp-html-processor.php:450
Stack trace:
#0 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(5145): WP_HTML_Processor->bail('Cannot run adop...')
#1 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(2423): WP_HTML_Processor->run_adoption_agency_algorithm()
#2 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(4108): WP_HTML_Processor->step_in_body()
#3 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(889): WP_HTML_Processor->step_in_foreign_content()
#4 /var/www/html/wp-includes/html-api/class-wp-html-processor.php(635): WP_HTML_Processor->step()```

sirreal avatar Aug 05 '24 15:08 sirreal

Merged in [58867] https://github.com/wordpress/wordpress-develop/commit/de084d7d0e302027fb8f0a99bbeedf88e079efee

dmsnell avatar Aug 08 '24 07:08 dmsnell