wordpress-develop icon indicating copy to clipboard operation
wordpress-develop copied to clipboard

HTML API: Reliably parse HTML in `wp_html_split()`.

Open dmsnell opened this issue 6 months ago • 5 comments

Trac ticket: Core-63694 Replaces #6651 See: (#9270), #9850, #9851

Status

  • [ ] This needs a new ticket for the 7.0 release.
  • [ ] Some of the unit tests can and should be updated separately.
  • [ ] Figure out why the test case is failing and fix it.

Design feedback

  • Core has previously considered HTML like <[[gallery]]> to be an escaped shortcode inside an HTML tag, but HTML considers it plaintext instead of a tag (because the starting character after the initial < is not a letter).
    • To match this behavior we can special-case text nodes which look like tags, but should we? This comes up in shortcode processing which decides not to replace shortcakes inside tags. So the ultimate question is: a. Is this actually a shortcode inside a tag to be ignored? b. Is this a shortcode inside a text node?
    • HTML provides the second answer (b). WordPress’ answer is contextual.
      • If it were <[gallery]> and the [gallery] shortcode translated into a tag name then this entire thing would become a tag on replacement.
      • If it translated into a non-tag-name, however, the replacement would remain plaintext.

Implementation

This probably improves the performance in terms of both CPU time and memory compared to the old PCRE-based approach.

dmsnell avatar Jul 15 '25 16:07 dmsnell