Discuss out-of-memory etc. agent cluster killing, especially for shared workers?
What is the issue with the HTML Standard?
Currently, the HTML Standard does not really discuss the fact that operating systems or browser implementations can kill processes, e.g. due to memory pressure. (The closest I can find is the killing scripts section.)
@yoshisatoyanagisawa is especially concerned about the case of implementing shared workers on Android, which has an OS-level OOM-killer. Under memory pressure, it would be very possible for a shared worker to be OOM-killed, while its clients are left alive. This could be confusing for developers.
My perspective is that this is kind of fine. Under memory pressure, I am sure many things get messed up. Your service worker might die. Your shared worker might die. Other Window clients you were communicating with via BroadcastChannel or storage APIs might die. But maybe this is something we should make explicit in the spec.
We could also use this opportunity to ask browsers to ensure that entire agent clusters get killed at one time, and they do not kill single agents within that agent cluster. If I recall correctly, this was one of the original motivations behind the design of agent clusters. But, it was never specified.
Alternately, if people believe that a shared worker spontaneously dying would be very bad, we could discuss more dramatic models, such as attempting to require that implementations kill all clients at the same time as they kill the shared worker. I am doubtful that this is a good idea though.
I'm especially curious if folks from WebKit and Gecko, which already implement shared workers on mobile devices, have any perspectives here. IIRC @asutherland is one person to tag here.
#2581 does not apply here because presumably we'd destroy the entire agent cluster here. Not just the shared worker, but also any dedicated workers it might own (including those it owns indirectly).
#4901 is relevant here. This is a more tightly scoped variant of that issue.
@youennf and @achristensen07 might have thoughts.
Is this something https://wicg.github.io/page-lifecycle/ should tackle?
Is this something https://wicg.github.io/page-lifecycle/ should tackle?
I'm not sure, but my initial impression is that specification seems to be about more-graceful unloading of individual agents within a window agent cluster, which is related but in the end pretty different.
I'd very much like to see SharedWorker on android, but it does feel like allowing SharedWorkers to be killed more frequently than a document that is visible and using the SW is a potential interop problem.
Can someone describe more about the android OS constraints? Why is the SharedWorker in a separate process from the same-origin documents that are using it? Can we use a priority inheritance scheme where the SharedWorker process gets its priority set to the same value used by any connected, visible document processes?
FWIW, process getting killed due to memory pressure is a thing that happens on iOS as well.
See Don't Remember Panicking stage 1 Update (keynote slides, pdf slides)
to be presented as a Stage 1 Update at the tc39 plenary next week. I believe my presentation will address many of these issues. It proposes adding a HostFaultHandler host hook through with browsers and other hosts can express policy about how they handle such faults.
Can someone describe more about the android OS constraints? Why is the SharedWorker in a separate process from the same-origin documents that are using it?
I somewhat touched the scenario in https://issues.chromium.org/issues/409441962, but I have not tried on Android yet and behavior can be different there.
However, I understand it can be done with the following way:
- create a SharedWorker from a window (Window 1)
- open yet another window (Window 2) with the same URL with Window 1, which can be allocated to the other renderer process.
- make Window 2 to create a SharedWorker with the same URL and name created by Window 1, then SharedWorkerClient in Window 2 will connect to SharedWorker in Window 1.
For this time, Window 2 is the same origin but in the different renderer process.
Can we use a priority inheritance scheme where the SharedWorker process gets its priority set to the same value used by any connected, visible document processes?
I think it technically possible to set higher priority than the connected documents to ask the system to avoid killing SharedWorker before the documents but I doubt it helps. Note that we expect SharedWorker to alive while connected documents exist, and I believe we should set higher priority than that.
I heard that the Chromium renderer process priority is capped by the browser process priority. Therefore, if Chromium goes to background, the browser process priority should be such, and so do the renderer processes. Then, technically all the renderer process including that has SharedWorker can be killed regardless of its priority. For this case, the renderer process with SharedWorker can be killed before the connected documents.
I know @kawasin73 is an expert in this area. I hope to listen to his thought.
Android supports the kill priority for service processes (i.e. renderer processes) by service binding flags and supports bucketing of 4 tiers (Context.BIND_IMPORTANT > default flag > Context.BIND_NOT_PERCEPTIBLE > Context.BIND_WAIVE_PRIORITY). Priority within the same bucket is (mostly) LRU of service binding timing change. Android has LMKD (Low Memory Killer Daemon) and LMKD kills processes in the kill priority order on memory pressure and on its timing.
Each service binding flags is translated into Android internal priority (a.k.a oom_score_adj):
Context.BIND_IMPORTANT=>PERSISTENT_SERVICE_ADJ (= -700)or0or1- default flag =>
VISIBLE_APP_ADJ (= 100) Context.BIND_NOT_PERCEPTIBLE=>PERCEPTIBLE_LOW_APP_ADJ (= 250)Context.BIND_WAIVE_PRIORITY=> no priority boosting (= 900 or more)
Chrome on Android uses the 4 tiers of bucketing to tell the kill order of renderer processes. For example, visible tabs and tabs with media stream are Context.BIND_IMPORTANT, and visible iframe (i.e. not main frame) renderer processes and renderer process with foreground service worker are default flag which leads to VISIBLE_APP_ADJ. And most background tabs are Context.BIND_WAIVE_PRIORITY. The conditions are implemented in ChildProcessLauncherHelperImpl.setPriority()
As @yoshisatoyanagisawa says, renderer processes' oom_score_adj is capped by the browser process. When the browser goes to background and having cached priority (= 900 or more), all renderer processes can have the same oom_score_adj. Processes between the same oom_score_adj are supposed to be sorted by LRU order, but due to several bugs and other reasons, the LRU order is not 100% reliable from the Chrome point of view.
So basically, Chrome uses the 4 tiers of bucketing to tell the kill order to Android OS. Chrome also tries to tell the kill order between the processes in the same bucket, but it is on the best effort basis and not reliable. Also when the browser goes to background, the kill priority is not reliable.
This is unlike Chrome on other desktop platforms which Chrome itself kills processes on memory pressure, while Chrome on Android does not kill any processes on memory pressure but only LMKD kills on memory pressure. On Android, actually there is no API to tell memory pressure from OS to applications by design. (Android had onTrimMemory() but it is deprecated for the memory pressure usages).
Am I understanding correctly that this problematic case only happens on Android when the browser is in the background? In other words, none of the connected documents are viewable by the user? Would it be possible to detect that the SW process was oom killed and force-kill or reload any attached docs? It seems plausible that any of the tabs could have been oom-killed while backgrounded.
I am not confident, but if only unavoidable case is Chromium in background, I feel @wanderview suggestion valid. I want @kawasin73 to verify.
I am not sure if there is a way to detect SharedWorker process kill on coming back from background, but if it is, it might be possible to enumerate clients and kill associated renderers when coming back from background.
I am not confident, but if only unavoidable case is Chromium in background, I feel @wanderview suggestion valid. I want @kawasin73 to verify.
As far as I heard from @nyaxt, Android process management is much more difficult to guess than I thought. When I wrote the previous comment, I thought a low priority process must always be killed before a high priority process. However, I heard it is not guaranteed. Considering our usage, @wanderview suggestion sounds not acceptable. It might increase possibility of unexpected renderer kills (i.e. page crashes while an end user is expecting it to be available)
Therefore, I guess it feasible not to guarantee SharedWorker is persistent. It might have been suggested in early days of the issue.
Reading @domenic's proposal for extended life SharedWorker made me wonder if we could add { restartable: true } option to SharedWorker. This would opt-in to allowing the browser to restart SharedWorker if it was killed by the OS, etc. Then maybe restartable SharedWorker could be supported on android, but normal SharedWorker would not be supported.
Could you please elaborate further on the restartable flag?
Specifically, what happens if a SharedWorker is killed while the restartable flag is disabled?
Additionally, is it considered reasonable to send an error message to the client's error handler when a SharedWorker is killed?
The idea would be that if a window is attached to a SharedWorker that is killed by a the OS killing a different process, then the browser would be free to restart the SharedWorker in the window's current process. Yea, it would probably need some lifecycle events. This is just a brainstorming idea. I haven't really tried to fully design it.
Thanks @wanderview for the detailed explanation. Let me write down the scenario to explain my understanding:
- Window A in Process 1 creates SharedWorker I.
- Window B in Process 2 executes
new SharedWorker()to be connected to SharedWorker I. - Window C in Process 3 executes
new SharedWorker()to be connected to SharedWorker I. - Window A has been closed, while SharedWorker I run on Process 1.
- The operating system kills Process 1 for some reasons, then SharedWorker I instance has been destroyed.
- If
restartableis enabled, SharedWorker II starts on Process 2, and Window B's SharedWorker client connects to it.
I come up with two questions:
Q1. What happens in Step 6 if restartable is not enabled?
Q2. Under restartable is enabled, where SharedWorker client in Window C connects.
For Q1, I suppose an error handler will get notified on the SharedWorker instance death, but there should not be a new instance. Under this context, restartable: false is considered to be manual SharedWorker recreation.
For Q2, if we goes with SharedWorker concept, making Window C to connect to SharedWorker II on Process 2 might make more sense. However, considering the use case explained in https://github.com/whatwg/html/issues/11205, making yet another SharedWorker in Process 3 might make sense. Then, what we wanted might not be restartable: true but isolated: true to create a SharedWorker instance per window or per process. (I have mixed feelings about this isolated SharedWorker concept)
Q1. What happens in Step 6 if restartable is not enabled?
I was thinking on android chromium could throw an exception if restartable: false. Basically only restartable SharedWorker is supported on android. Non-restartable is like the status-quo today. Not ideal, but would be incrementally better.
Q2. Under restartable is enabled, where SharedWorker client in Window C connects.
I was thinking all windows would reconnect. The browser would have to pick a process that is colocated with one of the windows to restart the SharedWorker. In the example, if B is then killed the restarting process could happen again to move to C.
Q1. What happens in Step 6 if restartable is not enabled?
I was thinking on android chromium could throw an exception if
restartable: false. Basically only restartable SharedWorker is supported on android. Non-restartable is like the status-quo today. Not ideal, but would be incrementally better.
In other words, users should be aware that SharedWorker on Android operates differently than on desktop, and should use it accordingly, is that correct?
I feel that you mean that it is better to aim for SharedWorkers generally not being killed by the OS, rather than creating an API that assumes they will be randomly terminated. Is that understanding correct?
Q2. Under restartable is enabled, where SharedWorker client in Window C connects.
I was thinking all windows would reconnect. The browser would have to pick a process that is colocated with one of the windows to restart the SharedWorker. In the example, if B is then killed the restarting process could happen again to move to C.
Ok, another instance is shared among clients even after restart.
In other words, users should be aware that SharedWorker on Android operates differently than on desktop, and should use it accordingly, is that correct?
Correct, adding a flag is a way to make sure on web developers who understand the new behavior are getting SharedWorker. That way we are not breaking backward compatibility. I am guessing devs who are clamoring for SharedWorker would be motivated to make it work in that scenario.
I like the "restartable" flag.
What do we do about the SharedWorker.port? Currently the port is specified to not be able to change:
The port attribute must return the value it was assigned by the object's constructor.
An added complication is I believe one can transfer/ship that port, although any situation where we create a new port would require "message" listeners to be re-applied by content unless we do some kind of hacky propagation.
I see the following broad options:
- Mint a new MessagePort, swapping it into
SharedWorker.portwhen we fire a "restarted" event on the SharedWorker. Any "close" events on the original MessagePort would be delivered as usual. - Create a new concept of forcibly disentangling/re-entangling MessagePorts so that the existing MessagePort and its existing message listeners still work.
Forcibly re-entangling seems like it could create significant implementation complexities if it's legal to transfer the SharedWorker.port, but a constraint we could impose on "restartable" is that we mark its port as unshippable (which would be a new concept). If the port can't be transferred, this also makes it less hacky for us to fire events like "restarted" on the SharedWorker instead of the MessagePort.
There's also potential web-compat issues with forcibly re-entangling depending on how messages that were in flight are handled. Presumably we would want to specify something like the "port message queue" would be cleared such that:
- A "restarted" event will be fired on the
SharedWorker.portwhen it is forcibly re-entangled. - Any messages sent via the SharedWorker's port prior to receiving the "restarted" event will not be delivered to the newly restarted SharedWorker. This is the most achievable and most important thing, since it avoids a restarted SharedWorker from seeing messages that might otherwise make no sense to it and could break its state machine.
- Any messages sent by the old SharedWorker will not be delivered after the "restarted" event is fired. There will be a lot of variability of what messages might arrive before the "restarted" event because there will likely be multiple processes and varying IPC minutae, but this seems like something people can at least reason about.
Clearly this needs a more detailed design, but I was thinking something somewhat simple to start:
- Old ports become non-functional when the SharedWorker is killed
- A lifecycle event fired to all SharedWorker event targets that provides a new port for the restarted SharedWorker
The site would be responsible for redistributing the new port to its component parts. The site layer would probably need to fire some kind of "resync state" logic to ensure missing messages are accounted for, etc.
I think this would give us the flexibility to do something more complicated in the future if we need to. Its just not clear to me if its worth moving this amount of complexity into the browser layer yet or not.
I'd like us to take a step back.
Firefox on Android and Safari on iOS have been shipping shared workers which can be killed (without any restarts) for years. Developers have not, to my knowledge, complained about this.
It sounds like the proposal here is that we'd ask those browsers to take a compat hit, and start throwing for shared workers on mobile when the developer doesn't pass { restartable: true }. I'm doubtful this is worth the pain for browser vendors and web developers to adapt to.
I think we should just recognize that shared workers, like service workers, might be killed, especially on mobile. (Like I suggest in the OP.) Chromium on mobile can implement the same semantics that Firefox and Safari have implemented on mobile.
If we find any evidence of developer pain, we can then ask them if version where the browser auto-restarts their shared worker (instead of them having to do it themselves) would alleviate that pain. We could even ask them to experiment with some of these semantics via a polyfill. And if they report that yes, this is valuable, we could add restartable shared workers as an option, on top of the base semantics.
Otherwise, I worry that we are getting too far out into designing a complicated solution that no web developers have yet asked for.
To clarify, I wasn't suggesting that safari and firefox would start throwing for non-restartable right away. They could wait and see if it became adopted.
But it seems a bit strange to ignore the realities of implementation platforms. If implementations cannot guarantee the things in the spec, shouldn't the spec change to accomodate that reality? Are you suggesting that the spec change to say "browsers may kill SharedWorkers on mobile platforms without restarting them (this is expected to be rare)"?
I didn't think we were discussing requiring browsers to throw for non-restartable SharedWorkers either... that seems like a... 😎 non-(re)starter.
If Chromium on mobile can just implement SharedWorkers and make sure the SharedWorker getting killed dispatches a close event on the MessagePort and we can see if developers can survive using that, that certainly simplifies things. The specification of the close event provides a sufficient if not perfect affordance that the SharedWorker has gone away without any additional spec changes (although content code does need to be listening for the event during the same task the SharedWorker is created).
I'm also fine though if the spec evolves to say that you are allowed (not required) to throw an error instead of starting a SharedWorker if it's non-restartable with some hand-waving about implementation choices.
Are you suggesting that the spec change to say "browsers may kill SharedWorkers on mobile platforms without restarting them (this is expected to be rare)"?
Yes, that was what the original message says:
My perspective is that this is kind of fine. Under memory pressure, I am sure many things get messed up. Your service worker might die. Your shared worker might die. Other Window clients you were communicating with via
BroadcastChannelor storage APIs might die. But maybe this is something we should make explicit in the spec.
Firefox on Android and Safari on iOS have been shipping shared workers which can be killed (without any restarts) for years. Developers have not, to my knowledge, complained about this.
Whoa. Not sure where to send the complaint messages, but let me log one regarding this behavior 👋. Safari's tendency to kill shared workers unexpectedly and without warning is why my company's messaging app does not support shared workers on mobile Safari (whereas we do support them in desktop browsers). The comparison of shared workers to service workers in this conversation also doesn't seem entirely appropriate. Service workers don't reliably hold state in memory (given that they frequently shut down). One of the desired use cases for SharedWorkers is so that you can reliably hold shared state in memory.
Developers have not, to my knowledge, complained
If there's a complaint department for Safari issues that I'm not aware of, point me at it. Because oh. my. god. that browser...
Stop flow run