gutenberg icon indicating copy to clipboard operation
gutenberg copied to clipboard

WP_HTML_Tag_Processor: Allow non-attribute lexical updates

Open ockham opened this issue 2 years ago • 6 comments

What?

Allow updating of arbitrary HTML (rather than just tag attributes) by derived classes by making $lexical_updates protected, and adding some required logic.

Why?

This is needed in order to be able to implement methods like set_content_inside_balanced_tags(). See #47036 and https://github.com/WordPress/gutenberg/pull/46680#discussion_r1065168298 for more context.

How?

As noted elsewhere:

[W]hile $attribute_updates is keyed by ("comparable", i.e. lowercase) attribute name, those keys are only relevant for set_attribute() (to check if an attribute of the same name already exists). They are however entirely ignored by apply_attributes_updates -- which is why we can use that mechanism for set_content_inside_balanced_tags without much modification 🎉

This PR thus:

  • Re-introduces $attribute_updates (after #47053 has renamed them to $lexical_updates);
  • Introduces a attribute_updates_to_lexical_updates() method (inspired by class_name_updates_to_attribute_updates()).
  • Runs that before apply_lexical_updates().
  • Makes $lexical_updates protected.

These changes only affect existing private methods and members, so we're not breaking any APIs.

TODO

With "arbitrary" lexical updates, there's more potential for collisions:

  • [x] There's a chance that existing bookmarks become invalid -- e.g. when an update overwrites an existing tag that had a bookmark attached. We thus need to add some logic to apply_lexical_updates that invalidates bookmarks, if necessary.
  • [ ] ~Even worse, there's a chance that lexical updates themselves collide. Imagine that in the aforementioned case, we have an attribute update for the tag that gets overwritten by the lexical update. The only way to avoid this might be to flush attribute updates before adding the lexical update (see).~ Edit: Or maybe this isn't really a problem after all, see https://github.com/WordPress/gutenberg/pull/47068#discussion_r1067427108.

Testing Instructions

See unit tests.

ockham avatar Jan 11 '23 13:01 ockham

Flaky tests detected in 41a3919d9ddef4bf478107882565bac41a129a2a. Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/4007383888 📝 Reported issues:

  • #39787 in specs/editor/various/multi-block-selection.test.js
  • #39782 in specs/editor/various/block-editor-keyboard-shortcuts.test.js

github-actions[bot] avatar Jan 11 '23 14:01 github-actions[bot]

As I've now implemented bookmark invalidation for bookmarks that are "overwritten", I'll tentatively open this for review.

Note that since the method itself is private, I haven't added any unit test coverage here. Instead, please refer to #47036, where set_content_inside_balanced_tags is implemented using the add_lexical_update method from this PR, and I've added some unit test coverage for set_content_inside_balanced_tags in that PR.

ockham avatar Jan 19 '23 17:01 ockham

The changes look good to me. Let's observe how we end up using this API – same as Dennis I am worried about the possibility of overlapping updates. I'll leave the final approval to @dmsnell as he clearly gave this code more though.

adamziel avatar Jan 20 '23 12:01 adamziel

The changes look good to me. Let's observe how we end up using this API – same as Dennis I am worried about the possibility of overlapping updates.

Thank you, Adam! Yeah, that was my major point of concern as well -- and the ultimate goal for this PR to avoid them! 😅 Per this convo, I've tried to ascertain that mostly by adding test coverage to #47036 😊 LMK if you can think of any other scenarios that aren't covered by those tests 😄

ockham avatar Jan 23 '23 14:01 ockham

What does the addition of lexical_updates and attribute_updates buy us? We can already distinguish these because attribute updates have a string array key while non-attribute updates are appended with [] = , giving them numeric keys. I find the distinction a bit confusing.

It seems like we now have two PRs here: one which unlocks a private method to subclassing and one which addresses some accounting issues for bookmarks that are eliminated by updates. Would it take much work to extract the bookmarking PR? I understand that before this PR the bookmark issue isn't really an issue at all, but I don't think it would be wrong to add the safety into the system before it's needed, and the additional integer comparison checks shouldn't appreciably slow anything down, particularly since we're only ever theoretically taking the same branch until more work is made available, such as proposed in this PR

dmsnell avatar Jan 26 '23 18:01 dmsnell

What does the addition of lexical_updates and attribute_updates buy us? We can already distinguish these because attribute updates have a string array key while non-attribute updates are appended with [] = , giving them numeric keys. I find the distinction a bit confusing.

For context, this PR originally also contained a add_lexical_update() method, which I removed per Adam's suggestion 😊

It seems like we now have two PRs here: one which unlocks a private method to subclassing and one which addresses some accounting issues for bookmarks that are eliminated by updates. Would it take much work to extract the bookmarking PR? I understand that before this PR the bookmark issue isn't really an issue at all, but I don't think it would be wrong to add the safety into the system before it's needed, and the additional integer comparison checks shouldn't appreciably slow anything down, particularly since we're only ever theoretically taking the same branch until more work is made available, such as proposed in this PR

I can file a PR to extract the bookmarking logic 👍

ockham avatar Jan 30 '23 13:01 ockham

Rebased, now that the bookmark invalidation PR (https://github.com/WordPress/gutenberg/pull/47559) has been merged.

ockham avatar Feb 21 '23 16:02 ockham

with $this->lexical_updates already being protected, is this required now? I'm still wondering what separating attribute from lexical updates gain us that we can't already do, apart from needing to check in two places for an update rather than one.

also, how do you feel about moving this PR into WordPress/WordPress-develop?

dmsnell avatar Feb 21 '23 21:02 dmsnell

Now I came around again @ockham @dmsnell .

The name lexical_update suggests the ability to apply a diff to any part of the markup. However, WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag.

I've been using this PR for the last week and struggled to:

  • Insert a tag or text before the currently processed tag
  • Replace the current tag's outer HTML

I'd add support for these use-cases before merging this PR. Otherwise, we're replacing a name that reflects the limitations reasonably well (attribute updates) with one that doesn't.

adamziel avatar Mar 02 '23 11:03 adamziel

suggests the ability to apply a diff to any part of the markup

It shouldn't suggest this. It should only suggest that at the point we have the diff it's purely lexical, based on the text and void of semantic information about the structure of the document.

WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag

Does it though? It doesn't in apply_attribute_updates()? It does overlook diffs inside of seek(), which seems like it would be a problem, but one we could also resolve with the same bookmark delta/shifting algorithm employed in apply_attribute_updates().


For both of the situations you listed I kind of envisioned we could create a zero-width bookmark before a tag starts and use that as a reference. I have not tested this out in practice, but that bookmark wouldn't interfere with any syntax and it wouldn't be affected by set_outer upates.

dmsnell avatar Mar 02 '23 18:03 dmsnell

with $this->lexical_updates already being protected, is this required now?

Probably not anymore!

I'm still wondering what separating attribute from lexical updates gain us that we can't already do, apart from needing to check in two places for an update rather than one.

Yeah, it would've been needed mostly if $lexical_updates were private.

There's one more reason I find somewhat compelling: For the update handling logic, the update array keys (which happen to be attribute names) are purely circumstantial. This PR basically reflects that updates can exist without such keys, and that attribute updates are basically just one class of such updates.

also, how do you feel about moving this PR into WordPress/WordPress-develop?

Happy to -- but would you say it's even still worthwhile pursuing? 😊

ockham avatar Mar 06 '23 10:03 ockham

suggests the ability to apply a diff to any part of the markup

It shouldn't suggest this. It should only suggest that at the point we have the diff it's purely lexical, based on the text and void of semantic information about the structure of the document.

Yeah, that's how I read "lexical" as well 👍😊

WP_HTML_Tag_Processor assumes that diffs are applied only to the currently processed tag

Does it though? It doesn't in apply_attribute_updates()? It does overlook diffs inside of seek(), which seems like it would be a problem, but one we could also resolve with the same bookmark delta/shifting algorithm employed in apply_attribute_updates().

For both of the situations you listed I kind of envisioned we could create a zero-width bookmark before a tag starts and use that as a reference. I have not tested this out in practice, but that bookmark wouldn't interfere with any syntax and it wouldn't be affected by set_outer upates.

Seems like a reasonable approach to me 👍

ockham avatar Mar 06 '23 10:03 ockham

the update array keys (which happen to be attribute names) are purely circumstantial

this wasn't circumstantial, because those keys were chosen so that subsequent updates would automatically replace earlier updates without us needing to track them.

that we can use numeric keys for non-attribute updates isn't entirely chance either. 😄

read "lexical"

the difference is "the ability to apply a diff to any part of the markup"

the rules are still supposed to be maintained about where an update can occur to an input document. that we've stripped those semantic rules away at the point we enqueue an update doesn't imply that those rules don't exist.

my point is that there's nothing in this class at all that should suggest that we can or should do arbitrary edits to the HTML. opening up the ability to do this is an unfortunate requirement of extending the class, and we should be on the lookout for code trying to apply diffs to "any part of the markup" and learn how we can better enforce the rules that it shouldn't be able to. never should it suggest you can change <div><p>Test</p></div> into <diTes>.

dmsnell avatar Mar 06 '23 19:03 dmsnell

my point is that there's nothing in this class at all that should suggest that we can or should do arbitrary edits to the HTML.

To me, there still is. Lexical update only has a start and an end; nothing about it communicates these semantic rules. An "attribute update" may be just a WP_HTML_Text_Replacement as well, but the name tells me specifically what it is about. With a lexical update, I need to dig in the docstrings. That's not a deal breaker, but I find it a worse DX.

That being said, we need more than attribute updates here or else we'll reimplement the same logic in Directive processor, HTML processor, etc. Perhaps this PR is for the best then – we'll likely find ourselves adjusting the flushing logic to cater more use-cases anyway. In other words, I'm fine with this name change that I think is a bit misleading because I believe the behavior will change in a way I'll find more intuitive.

never should it suggest you can change

Test

into <diTes>.

Oh I didn't mean that. I just meant changing the current <div> into <span>. I like the zero-width bookmark idea!

adamziel avatar Mar 08 '23 18:03 adamziel

if we're getting "this allows arbitrary changes" out of "lexical updates" then I think we should scrap lexical updates, but find something better than "attribute updates"

"semantic_patches" ?

I'm still struggling with why having text-based diffs implies the rules go out the window. these aren't and never were semantic updates, and from the day we created attribute updates we said "it's up to anyone creating these to maintain the semantic rules" and clearly we haven't found a good name yet to communicate that.

$lexical_updates_that_must_respect_html_syntax_boundaries

dmsnell avatar Mar 09 '23 00:03 dmsnell

Closing this PR as obsolete, now that we have https://github.com/WordPress/wordpress-develop/pull/4519.

ockham avatar Jul 10 '23 12:07 ockham