dom Side effects due to tree insertion or removal (script, iframe)

Elements that trigger a script-executing side effect due to insertion or removal need to do so using some kind of queue if we ever want to remove mutation events fully (see #305). In particular, I think it's important that tree mutation succeeds and that any script is executed afterwards (but before custom element reactions, see https://github.com/web-platform-tests/wpt/pull/20660).

I'd like to use this thread for discussion, since currently it's spread among various places (https://github.com/whatwg/dom/issues/575, https://github.com/whatwg/dom/pull/732, https://github.com/whatwg/html/pull/4354, https://github.com/whatwg/html/issues/4611, https://github.com/whatwg/html/issues/4965).

Chrome and Firefox have this model for script insertion. (Though there are differences, if two scripts are inserted and the first removes the second, Firefox will execute the second, Chrome will not. They also react differently to changes to children of the script element, with Chrome only caring about insertions there.)

Chrome has this model for iframe insertion as well, but not removal. Firefox only has it for removal (insertion cannot trigger script in Firefox as it does not have a synchronous load event; note that this means that Chrome and Firefox create nested browsing contexts at different times).

Firefox has this model for style insertion, but since style cannot trigger script I don't think we should follow that.

@nox @bzbarsky @tkent-google @rniwa thoughts? Do we all share the overall goal of executing script at a "safe" point? How would you like to see this broken down further?

Dec 09 '19 13:12 annevk

@emilio has thought about this some as well, I think.

Dec 09 '19 14:12 bzbarsky

Gecko has a slightly different (and somewhat weird) model for <iframe>, but it has some sort of model.

Gecko loads iframes from the outermost call Document::EndUpdate, which gets called after ever DOM mutation and such. If it's safe to run script there, then it does it synchronously, otherwise it adds to the same queue <style> uses (via AddScriptRunner).

Dec 09 '19 15:12 emilio

As noted by @nox <meta http-equiv=default-style> is also deferred in Chrome (and Safari it seems). Filed https://bugzilla.mozilla.org/show_bug.cgi?id=1602476 on Firefox not handling it dynamically.

Dec 09 '19 15:12 annevk

@mfreed7 @tkent-google @rniwa @cdumez ping.

Apr 02 '20 11:04 annevk

(I delegate to @mfreed7. I'm not responsible for HTML any longer.)

Apr 03 '20 00:04 tkent-google

In WebKit, insertion and removal are different.

Newly inserted nodes are notified in two stages: insertedIntoAncestor and didFinishInsertingNodes. The former is responsible for updating internal node's states and not allowed to execute any scripts. The latter is called on a subset of nodes which may execute scripts after node insertion. A subset of elements such as HTMLScriptElement overrides it to execute scripts. Both of these things happen synchronously as we remove nodes.

During removal, we disconnect all subframes first (during which introduction of nested browsing contexts are prohibited) and then remove nodes but we don't allow any scripts to execute as a result of removing nodes once we start removing nodes. All of this happens synchronously.

Apr 07 '20 00:04 rniwa

@josepharhar

May 04 '20 21:05 mfreed7

Elements that trigger a script-executing side effect due to insertion or removal need to do so using some kind of queue if we ever want to remove mutation events fully

By some kind of queue, do you mean that they need to fire events and run script asynchronously by posting a task instead of synchronously? I wasn't familiar with mutation events before this and I'm still not super familiar with mutation observer - are mutation events synchronous and is mutation observer async?

Chrome and Firefox have this model for script insertion.

By this model, do you mean they do sync as opposed to async? Should script tags have their script executed async to when they are appended?

May 04 '20 23:05 josepharhar

No, insertion/removal runs "synchronous callbacks" for multiple elements. After those callbacks, we run another set of "synchronous callbacks" to execute script for impacted elements. There's no tasks involved. Queue is a reference to https://infra.spec.whatwg.org/#queue (a type of list).

May 05 '20 06:05 annevk

Is the goal here simply to remove mutation events? Based on tkent's doc I found in https://github.com/whatwg/dom/issues/305, it sounds like we would already have to contact sites to get them to stop using mutation events in order to change their behavior - why not skip the change in behavior and just get rid of them altogether in one step?

Chrome and Firefox have this model for script insertion. (Though there are differences, if two scripts are inserted and the first removes the second, Firefox will execute the second, Chrome will not. They also react differently to changes to children of the script element, with Chrome only caring about insertions there.)

Chrome has this model for iframe insertion as well, but not removal. Firefox only has it for removal (insertion cannot trigger script in Firefox as it does not have a synchronous load event; note that this means that Chrome and Firefox create nested browsing contexts at different times).

Firefox has this model for style insertion, but since style cannot trigger script I don't think we should follow that.

Are there any examples which would help me understand how these operations run script? Is it all just mutation events or is it something else?

Sep 10 '20 17:09 josepharhar

Are there any examples which would help me understand how these operations run script?

Inserting an inline <script> (one without src) runs the script text in it immediately, before the insert call returns.
Inserting an iframe with no src attribute set (and maybe with one set to about:blank?) fires a load event synchronously, before the call returns, in Chrome.
Removing an iframe synchronously fires beforeunload and unload events on the document inside, last I checked.

None of that is related to mutation events.

Sep 10 '20 17:09 bzbarsky

Inserting an inline

Inserting an iframe with no src attribute set (and maybe with one set to about:blank?) fires a load event synchronously, before the call returns, in Chrome.

Removing an iframe synchronously fires beforeunload and unload events on the document inside, last I checked.

Thanks for listing these examples! I tried reproducing all of them here: https://dom-side-effects.glitch.me After running it in each browser, these are my observations:

When inserting an inline <script>, the script will be executed before appendChild returns, just like you said.
When inserting an iframe which doesn't have a src attribute, chrome/safari will fire the load event before appendChild returns, whereas firefox will fire the load event after appendChild returns, just like you said.
When removing an iframe, the unload event is fired before remove() returns, just like you said.

Are there any other side effects of interest, or is that it?

For the script and unload cases, is the goal really to make script get executed after insertion and make unload fire after remove? I don't really see the benefit and I feel like we would just be breaking websites, especially since the behavior is already the same in chrome firefox and safari.

For the sync load event case, I suppose it is worth looking into changing in chrome, although I think this could be very challenging to implement and I'm afraid of breaking websites. Does anyone know if there is already a crbug open for this?

Sep 14 '20 18:09 josepharhar

Are there any other side effects of interest, or is that it?

That is one of the open questions, yes.

For the script and unload cases, is the goal really to make script get executed after insertion and make unload fire after remove?

I believe the intent is to define exactly when during the insertion or removal the side effects happen. For example, if you insert a subtree into the DOM that contains two <script> tags, we need to define when exactly those run and what the state of the DOM looks like when they do. Right now the spec is not very clear on that.

For the sync load event case

Please see https://github.com/whatwg/html/issues/4965 for previous discussion on that topic. Your fears are not unjustified, but we still need to figure out how to spec something that is web-compatible (the current spec is not, afaict) and then get everyone to converge on that behavior.

And again, if the behavior here ends up "sync load event", then you have to define exactly when it fires. Again, consider inserting a subtree with several iframes and script tags in it; the precise order of all the script executions and the DOM state they see needs to be defined. Including in the face of the executions further mutating the DOM.

Sep 15 '20 05:09 bzbarsky

Would it be possible to defer removal side effects until the next style/layout update (typically the next rendering update). This would align nicely with display: none timing and also could support cases where developers move nodes without needing to add new APIs as long as they leave the node attached for each rendering opportunity.

Dec 08 '22 21:12 flackr

iframe teardown has to be pretty much synchronous. Having a child document associated with a browsing context that's at the same time not connected to its parent has proven to be disastrous in the past. The main question for this issue is what's the state of the world if you remove multiple iframe elements at once and if you get to observe the intermediate states.

Dec 09 '22 09:12 annevk

I've discussed this a bit with Tab and at length with Anne and I've tried to compile and synthetize the UA behaviors in various scenarios.

I'm posting this before I submit the report on the current state of UAs (base on Anne's work chasing down inconsistencies), to make sure:

I didn't forget something important
We agree on the conceptual framework
We agree on the terminology

Timeline

Given a composite DOM operation (adding a fragment with several children, or setting el.textContent when el has several children) I propose we speak of:

synchronous effects, intermingled with tree operations. e.g.: running inline scripts in Safari
tightly async effects, queued and happening either just before or just after the tree manipulations but before control is given back to the JS that set the operation in motion. e.g.: mutation events (before), connectedCallback (after), ...
fully async effects, queued for a later tick. e.g.: mutation observers, layout (unless forced from JS), ...

There are several tightly async queues (both before and after). We'll say that queues that are closer in time to DOM tree manipulations are tighter.

The terminology here probably isn't optimal, because one might be tempted to say that a queue that is less tight is looser, perhaps seeding confusion... Maybe we can agree on avoiding "loose" in these conversations, and use "less tight" instead.

Timeline boundaries

Another useful concept re sync vs async would be timeline boundaries:

el.append(fragment)
[ // control passes to the browser
  [ /* mutation events queue */ ]
  [ /* tree operations */ ]
  [ /* custom element hooks queue */ ]
] //control returned to the calling code

Where the queues are async relative to DOM tree manipulations, but within the sync bounds of el.append().

Effect typology

I've also tried to categorize the nature of the effects we deal with that are either synchronous or tightly async, at least in some UAs:

DOM tree mutations
non-JS effects (i.e. things that don't cause user JS to run)
1. state changes 1) on DOM nodes (like setting button.form when inserting a button in a form, or setting contentWindow to null when removing an iframe) 2) elsewhere, like named properties on the global object
2. CSSOM operations (incl. reacting to the insertion of <meta content=alternative>)
JS effects
1. Inline scripts
2. Mutation events
3. Custom element hooks
4. iframe.onunload

AFAICT 2.x operations don't trigger JS because they either mutate internal slots, or operate on objects that are locked and can't be spied upon using setters.

I've only investigated this from JS, which makes it impossible to determine whether 2.x effects are synchronous or tightly async, with queues tighter than the JS ones. It shouldn't matter as far as WebCompat or specs are concerned, since these details are not user visible. I may refer to these as "synchronous 2.x" for convenience.

Going forward

Provided that the UAs have different behavior, it is unlikely that user code in production relies on their timelines and this gives us leeway to reorder things.

It would seem ideal to

make synchronous JS tightly async
make all 2.x (non-JS) effects synchronous or tighter than JS, such that 2.x operations appear atomic when seen from JS
agree on a common timeline across UAs
- including for reentrant behavior when tightly async JS causes another tree mutation

edit: added the bit about timeline boundaries

Feb 15 '23 09:02 pygy

Yeah agreed. While we need to define state changes, their relative order is only important to the extent that it is observable through JavaScript.

And I personally think it would be great if JavaScript could be deferred as much as possible to the end, but mutation events are a major gotcha there. As I believe there is still some ambition to remove or change the timing of mutation events, I'm not sure we want to include them as part of this work. That would increase the complexity quite a bit.

Feb 15 '23 10:02 annevk

I've noticed two pre-DOM queues: one for the DOMNodeRemoved mutation event (I've yet to investigate the others, not a priority I agree), and one for iframe.onunload callbacks in Chrome and Safari.

The latter intermixes JS callbacks and state changes (each iframe.contentWindow is set to null just after its iframe.onuload runs).

If they can't be moved post-DOM for Web compat reasons, a possible mitigation would be, when DOM operations are triggered from JS, to register every node involved in a Set, and to prevent as many mutations as possible on these nodes from the pre-DOM hooks by throwing errors. This would move us closer to transactional behavior.

I suppose the list of operations that can be safely prohibited could be determined thanks to telemetry.

Feb 15 '23 11:02 pygy

I believe the intent is to define exactly when during the insertion or removal the side effects happen. For example, if you insert a subtree into the DOM that contains two

It surprised me to read this. It's possible things in the spec have seriously changed for the better in the intervening years, but I always understood HTML/DOM to be pretty explicit about when insertion/removal side-effects happen. For example, DOM defines the insertion steps hook, which HTML overrides. Furthermore, HTML provides a per-element HTML element insertion steps hook that runs during HTML's own "insertion steps" for any element that overrides said hook. HTML also provides more granular hooks for acting on the narrow case when an element is "inserted into a document", and the wider/general case of when an element "becomes connected".

These hooks should pretty clearly define the ordering of insertion/removal side-effects, especially for simple elements that don't do a lot during these hooks. But even for complicated cases like in the OP, or with multiple iframe removals (as @annevk mentioned), I think the spec is at least clear on what it calls for.

Removing multiple iframes

From https://github.com/whatwg/dom/issues/808#issuecomment-1344051386:

The main question for this issue is what's the state of the world if you remove multiple iframe elements at once and if you get to observe the intermediate states.

I think the spec is clear on this, if for no other reason than in the intervening years the navigation and session history rewrite has landed, and follow-ups like https://github.com/whatwg/html/pull/9907 have made things even better. If you have a document with two sibling iframes, and a script that calls remove() on them back-to-back in the same task, you'd get the following flow twice. (In bold should be every place where the spec allows script to run, and thus some observation of the current state.)

ChildNode#remove() → remove algorithm
Removing steps hook
HTML element removing steps for iframes
- Destroy a document and its descendants, in parallel (asynchronously jumping back to the main thread to do things)
- Queue a "remove" mutation record on document.body with the iframe

The spec is equally clear for cases with nested iframes and so on, as calls to the "removing steps" are properly recursive. So spec-wise, I don't think there should be a way to observe the "intermediate state" at all when all removes happen synchronously with respect to each other. However, implementations seem to all disagree with the spec in the same way!

Chrome, Firefox, Safari All browsers fire unload events synchronously "near" the destroy a document and its descendants algorithm (before remove() returns). This is not only inconsistent with HTML, which doesn't do this, but is totally opposite of this note, which specifically says that unload events do not fire in this path. Tagging @domenic just in case he ran into any of this during the recent cleanups there.

This is kinda weird, but I don't think it's "broken", and FWIW all browsers are dead consistent. Here's what they can observe in the unload handler "mid-removal":

window.parent != null (i.e., window.parent.postMessage() still works)
- I can't tell if this is called for by the spec; that is, I can't tell what #dom-parent should return in this state. During child destruction, but in parallel, we clear the active SHE's document, which I think is the only thing that would make this's navigable return null, and thus window.parent return null. But that's in parallel, so I don't think the spec actually does anything to clear the parent immediately. So I think the spec is technically OK here.
window.frameElement == null
- This is covered by HTML. frameElement's nullness is dependent on whether the iframe's content navigable has a container. Yet synchronously during the iframe element removal steps (which call "destroy a child navigable), the container's content navigable is nulled out, thus severing ties between the two navigables. This should make frameElement return null ✅. The only oddity is that unload also happens to fire synchronously somewhere in the middle of "destroy a child navigable" algorithm. All browsers happen to fire unload after the container's content navigable is nulled out, which is why they are all consistent with frameElement == null.

But this unload issue seems like a scoped HTML-navigable-destruction bug, and less about the wider hooks in the overall DOM world that I would expect to block https://github.com/whatwg/html/issues/5484. If @domenic agrees I'll file something on HTML.

Inserted DocumentFragment with two scripts

Consider the example in the OP. This is the flow the spec follows (again, bold is where (queued) scripts will run):

https://dom.spec.whatwg.org/#concept-node-insert
Queue a "remove" mutation record on fragment with all its children (both scripts)
Run insertion steps on each (x2)
- Causes the HTML element insertion steps to run
- Triggers the becomes connected hook for script elements
- UAs must immediately prepare the script element
- Create a classic script
- Immediately execute the script
  - Cannot observe <script> elements that will be inserted after it
Queue an "add" mutation record on document with fragment's children (both scripts)

Only Safari follows the spec, and Chrome and Firefox behave exactly how the OP describes. So there is a clear interop issue, but the standard at least makes it clear which implementation is right vs. wrong. The question is: is the spec + Safari being aligned sufficient to simply call these "Chrome and Firefox bugs," or do we want to adjust the standard? I think the spec + Safari behavior is closer to what we all want, so maybe we can just file some Chrome + Firefox bugs, but I'd love to know what @annevk thinks.

Jan 29 '24 03:01 domfarolino

I think what I dislike about the current model is that insertion is not atomic from the perspective of script. If it works for everyone though we should add the various scenarios mentioned here and related issues to WPT and call it a day.

Jan 30 '24 13:01 annevk

I think what I dislike about the current model is that insertion is not atomic from the perspective of script.

Can you clarify exactly what you mean? Outside of a few implementation bugs mentioned above, it seems the spec is pretty clear that insertion/removal should be atomic from the perspective of script, since I don't think any overridden insertion / removal steps synchronously invoke script, right?

Edit: One obvious exception I guess would be iframe insertion, which very explicitly fires the load event during initial insertion. In general, insertion feels less sketchy to do this during though.

If it works for everyone though we should add the various scenarios mentioned here and related issues to WPT and call it a day.

Regardless, yes I agree, I'll get a PR up for adding these scenarios as WPTs and continue the discussion here if anything interesting arises.

Jan 30 '24 20:01 domfarolino

@domfarolino I mean that the insert operation hasn't completed by the time script executes (there's still another script to insert). From that perspective I like the staging that I think Gecko has. Whereby you complete insertion and then have another loop/queue for side effects.

Jan 31 '24 09:01 annevk

I don't remember if I ever followed up on this (health issues make me not as reliable as I'd like), but I remember finding many observable discrepancies in the way insertion and removal work last year.

Unless it has been homogenized in the mean time, there was room for straightening the timing in a way that seemed Web compatible. I'll try to look for what I had found and re-check anyway.

Jan 31 '24 13:01 pygy

@domfarolino I mean that the insert operation hasn't completed by the time script executes (there's still another script to insert).

I see. So you would like multiple, synchronously back-to-back insertions (of say, several <script> elements) to not invoke script until all insertions are done. I guess there is a question of what "atomic" means here. I can imagine two ways it could work:

Insertion is atomic, at a per-task level
- That is, if a single task calls document.body.append(script1); document.body.append(script2);, literally neither script executes until that task (that did the appending) is done. Then both execute in order.
- This model feels weird. If the developer is explicitly separating the insertion of the two scripts, there is no reason why script1 shouldn't execute after it has been inserted, but before the second append() is called.
Insertion is atomic, at a per-"insertion call" level
- Imagine document.body.append(script1); document.body.append(fragmentWithTwoScripts); (see https://github.com/web-platform-tests/wpt/pull/44308)
- In this model, appending script1 would call the script's insertion steps, and would not execute the script until the stack is cleared of all "insertion calls". So inside the first append() (but after the "insertion" algorithm returns), script1 would immediately run.
- Next, the document fragment with two scripts would get inserted; this puts an "insert call" on the stack. Within that call, we run the insertion steps for each node, but none of these are allowed to synchronously invoke script, because a call to "insert" is on the stack. Finally, the top-most "insert" algorithm returns, and just before append() returns all scripts that were queued from any "insertion steps" hooks are executed in order.

From that perspective I like the staging that I think Gecko has. Whereby you complete insertion and then have another loop/queue for side effects.

I have two comments on this. First, I think Gecko's model is fairly desirable, however I think Chrome's is slightly more consistent. Both are pretty much (2) above. Both implementations execute scripts after all have been appended to the DOM (so that earlier scripts can observe later scripts and remove them from the DOM). However Chromium refuses to execute a script that an earlier script removed from the DOM, which I think is consistent with our general policy of not executing scripts that have changed documents (https://github.com/whatwg/html/pull/5575).

Second, if we think (2) above is desirable, do we think we can actually change the spec + Safari to align on all of this? For <script> elements in particular, it is probably doable since 2/3 implementations pretty much already do some version of this. However generalizing it seems harder, since for at least iframes, ever browser aligns on synchronously firing the load event during the HTML element insertion steps. Changing that to be consistent with the model (2) above is almost certainly a compat issue. Thoughts?

Jan 31 '24 19:01 domfarolino

Thank you for the load event example. However, my results are different from yours. At least from what I'm seeing Chromium and Gecko don't execute the event listener until after both nodes are inserted. Could you share your test?

In any event, what we end up with has to be consistent. You cannot have some script execution be delayed, but still be able to do script execution with some other element. Consistency trumps any desire for atomic insert operations. (Atomicity per task can't really work by the way. Scripts already execute way sooner than that.)

Feb 01 '24 08:02 annevk

However generalizing it seems harder, since for at least iframes, ever browser aligns on synchronously firing the load event during the HTML element insertion steps. Changing that to be consistent with the model (2) above is almost certainly a compat issue. Thoughts?

Thank you for the load event example. However, my results are different from yours. At least from what I'm seeing Chromium and Gecko don't execute the event listener until after both nodes are inserted. Could you share your test?

Yep you are correct, the load event in Blink & Gecko indeed are consistently fired after all DOM mutations are done Thanks for confirming, that makes me feel better about this :)

@noamr has kindly helped shape up https://github.com/web-platform-tests/wpt/pull/15264 into the new https://github.com/web-platform-tests/wpt/pull/44658 for tentative tests around the model proposed here. We're planning on landing that alongside https://github.com/web-platform-tests/wpt/pull/44308, and we'll be simultaneously reviving https://github.com/whatwg/dom/pull/732 (which you also pointed me to) to try and finish this off.

Note that regarding removal, I don't think we an have the same kind of "atomicity" that we're shooting for with insertion. In the insertion case, we can defer some actions until after DOM mutations because the element is supposed to be connected at that point. Removals work differently because the removal steps run before each node actually gets removed from the DOM / disconnected. Moving the steps after all DOM mutations (for removals) would mean executing steps on disconnected nodes, which sounds like a bad idea (and the cause for many security bugs it sounds). So I think that preserving the asymmetry between insertion and removal is OK. Let me know if you think otherwise.

Feb 22 '24 15:02 domfarolino

Seems this was prematurely closed?

Mar 04 '24 20:03 domfarolino

Yeah, don't write "resolve" followed by a link to an issue in your commit message if they don't actually resolve it. (It was initially not a problem as the bot doesn't have access to this repository, but some other people including rniwa do.)

Mar 05 '24 10:03 annevk

dom dom copied to clipboard

Side effects due to tree insertion or removal (script, iframe)

Timeline

Timeline boundaries

Effect typology

Going forward

Removing multiple iframes

Inserted DocumentFragment with two scripts

dom
dom copied to clipboard