workbox icon indicating copy to clipboard operation
workbox copied to clipboard

Service worker stuck in busy state when SSE request hits the cache

Open andyrichardson opened this issue 3 years ago • 14 comments

Library Affected: workbox-routing

Browser & Platform: all browsers

Issue or Feature Request Description: Bug report:

Consider the following service worker

import { clientsClaim, skipWaiting } from 'workbox-core';
import { registerRoute } from **'workbox-routing';
import { StaleWhileRevalidate } from 'workbox-strategies';

skipWaiting();
clientsClaim();

registerRoute(
  ({ request, url }) => request.url.origin === self.location.origin && /(?<!service-worker)\.js$/.test(url.pathname),
  new StaleWhileRevalidate({
    cacheName: 'app',
  }),
);

When a new service worker exists and installation of the newer worker begins, it would be expected that the use of skipWaiting would activate the new service worker.

This isn't the case however because the registerRoute function instantiates a fetch handler that never calls waitUntil.

The result of this is that the updated service worker will never activate (stuck in installed state) until the user's browser session is closed.

andyrichardson avatar Dec 04 '20 21:12 andyrichardson

Hello @andyrichardson! I am not sure that your diagnosis of what's going on is accurate.

Calling respondWith(responsePromise) on a FetchEvent already extends the lifetime of the service worker until responsePromise settles. C.f. the note in step 4 of the respondWith() description in the spec:

Note: event.respondWith(r) extends the lifetime of the event by default as if event.waitUntil(r) is called.

Calling skipWaiting() does three things in parallel:

  • sets a flag indicating that if the current service worker enters the installed state, it should immediately transition to the activating state.
  • tells the service worker registration associated with the current service worker that if there's a service worker that's already waiting, it should transition to activating.
  • returns a promise that resolves immediately with undefined (not waiting for the other two steps to complete).

If you're trying to debug why a service worker is waiting in the installed state instead of activating, it might be because you're calling skipWaiting() in the incorrect service worker (the old one, not the new one that's actually waiting), or because your code is more complex than the example you included above and you're running afoul of something like https://github.com/GoogleChrome/workbox/issues/2525. Because of that issue, we've stopped wrapping skipWaiting() in an install event handler in Workbox v6, and encourage folks to just call self.skipWaiting() instead.

jeffposnick avatar Dec 05 '20 21:12 jeffposnick

Hey @jeffposnick, thanks for the response.

Calling respondWith(responsePromise) on a FetchEvent already extends the lifetime of the service worker until responsePromise settles.

Totally with you on this - even without checking the spec, I was under a similar assumption that respondWith would remove the need for waitUntil.

I've managed to consistently reproduce this issue on Chrome and Firefox and the one line change in the PR fixed the problem. I'm not totally sure why this would be the case.

I'll spend some time this week creating a reproduction so you can try it out for yourself!

Just to confirm, have you tried creating a service worker which calls registerRoute and then subsequently try and update that service worker? This has consistently caused the new service worker on to get stuck in the installed state on Chrome and Firefox on my end.

andyrichardson avatar Dec 07 '20 10:12 andyrichardson

Before you spend too much more time debugging this, can you try a quick swap from using workbox-core's skipWaiting() to using self.skipWaiting() instead, and see if that resolves things? (You could alternatively update to Workbox v6. Following #2525, the two are equivalent.)

The previous implement of workbox-core's skipWaiting() wrapped the call to self.skipWaiting() in an install event handler, and that ended up introducing edge cases. That's really the only thing that comes to mind as a possible cause of what you're describing.

jeffposnick avatar Dec 07 '20 16:12 jeffposnick

I had tried calling self.skipWaiting:

  • On js load
  • On install event
  • On message from browser (w/ message added to browser)

But no dice!

The current project I'm on uses whatever is bundled with CRA but I'll try a standalone example without CRA and see if this is still a concern.

andyrichardson avatar Dec 08 '20 16:12 andyrichardson

That's very strange.

What happens if you go to Chrome's DevTools and manually click on the "skipWaiting" link that you should see next to the waiting service worker?

77b5eaa130252f7e

jeffposnick avatar Dec 08 '20 17:12 jeffposnick

That's very strange.

What happens if you go to Chrome's DevTools and manually click on the "skipWaiting" link that you should see next to the waiting service worker?

Manually clicking skipWaiting or update does nothing (no error or visual indication of a change). I have to either hit stop or unregister to get the new service worker activated.

Kapture 2020-12-08 at 18 20 15

I got around to creating a simple reproduction (with v5 and v6 package versions) but I'm unable to reproduce the issue in a more basic environment so I'm not sure what the cause is.


If I'm not mistaken, this nesting implies that the service worker has spawned a worker thread - that might be what is keeping the service worker from unloading.

Screen Shot 2020-12-08 at 18 28 39

That spawned worker thread looks to have some relation to buffering

Screen Shot 2020-12-08 at 18 31 18

andyrichardson avatar Dec 08 '20 18:12 andyrichardson

Jeeeeeezzz

Okay I found the cause of the issue.

registerRoute(
  ({ url }) => /^somefeatureflagservice/.test(url.host),
  new StaleWhileRevalidate({
    cacheName: `flags`,
  }),
);

In my case, I was caching a feature flagging network request for offline support.

This request was a ~websocket~ Server-sent event (SSE) request.

I guess that explains why adding waitUntil along with respondWith fixed the issue.

respondWith doesn't flag the event as closed with a streaming connection (multiple responses are possible) while waitUntil(someProm) does (once someProm resolves I'm guessing it assumes a connection close).

andyrichardson avatar Dec 08 '20 18:12 andyrichardson

~Not sure where we go from here - it would be nice to have that waitUntil addition to cover situations where websocket requests are being cached.~

Edit: Both via the cache and network, SSE requests look to get infinitely stuck in the pending state when using StaleWhileRevalidate

Is workbox's cache intended to work with SSE/websocket/streaming connections?

If not, do we want to add in a warning / skip the cache altogether so we don't keep a fetch event open indefinitely?

andyrichardson avatar Dec 08 '20 18:12 andyrichardson

Ah, okay, glad that there's an explanation!

Service workers don't intercept WebSocket requests, but they can be used with the Streams API, and they also explicitly do intercept SSE, as per this thread that @wanderview participated in back in the day. That thread discussed potential edge cases and had an overall consensus that using service workers with SSE was likely to be uncommon.

I'm not ruling out making a change to Workbox if needed, but I'm wondering if there's an underlying issue that would better be addressed by the browser(s). You mentioned both Chrome and Firefox exhibited that behavior—how about Safari?

I'd like to summarize the issue and then pull in some folks for their opinion. Please correct me if I'm wrong about any of this:

  • You have a service worker, using Workbox, whose fetch handler calls event.respondWith(responsePromise) to generate a response for a SSE. (I'm not clear on whether this was a mistake since you didn't realize it was a SSE, or whether you actually meant to do this.)

  • Doing so keeps that service worker alive for an extended period of time. (Presumably until the server closes the connection, though maybe there's a hard time limit imposed by the browser?)

  • If, during the time that the active service worker is being kept alive, a newly installed service worker calls self.skipWaiting(), this has no effect. The new service worker remains in the installed state indefinitely, and things appear to be "stuck."

  • But, if the active service worker had called event.waitUntil(responsePromise) inside its fetch handler, immediately prior to calling event.respondWith(responsePromise), then calling self.skipWaiting() in the newly installed service worker behaves as intended, and things are not "stuck." (This is what https://github.com/GoogleChrome/workbox/pull/2693 added.)

  • The service worker spec states that "event.respondWith(r) extends the lifetime of the event by default as if event.waitUntil(r) is called.", so I find it unexpected that explicitly calling event.waitUntil() changes the behavior.

jeffposnick avatar Dec 08 '20 21:12 jeffposnick

I would expect the service worker to be killed after 5 minutes of being held open by respondWith() processing an SSE response. (Unless another FetchEvent or something occurred since then of course...)

Note, if you open devtools then chrome will not terminate the service worker due to this timeout.

wanderview avatar Dec 08 '20 21:12 wanderview

Thanks, @wanderview. Any thoughts on those final two points, where adding in an explicit event.waitUntil() prior to the event.respondWith() leads to different behavior? Does that sound like a browser bug, or am I misinterpreting the service worker spec?

jeffposnick avatar Dec 08 '20 21:12 jeffposnick

Doesn't make sense from the browser perspective. I don't really know what workbox is doing around where it calls those, though. Maybe its an incorrect association and the real issue was whether devtools had been opened or not, etc.

wanderview avatar Dec 08 '20 21:12 wanderview

Gotcha. If the browser technically shouldn't require that explicit event.waitUntil(), then I'd rather not add it in to Workbox to work around the reported behavior.

@andyrichardson, what if we added additional logging that's enabled in the workbox-routing development builds which checks for fetchEvent.request.headers.get('accept') === 'text/event-stream' and warns when workbox-router attempts to respond to such an event?

I'll wait to hear back from you, but my understanding is that you didn't actually want to have the service worker intercept that SSE, and it's not clear how a StaleWhileRevalidate strategy could properly handle it anyway—though a different workbox-strategy, like NetworkOnly, presumably would be able to handle it.

jeffposnick avatar Dec 08 '20 21:12 jeffposnick

Hey folks! Thanks for the discussion on this!

You have a service worker, using Workbox, whose fetch handler calls event.respondWith(responsePromise) to generate a response for a SSE. (I'm not clear on whether this was a mistake since you didn't realize it was a SSE, or whether you actually meant to do this.)

Yes that's pretty much it

registerRoute(
  ({ url }) => /^somefeatureflagservice/.test(url.host),
  new StaleWhileRevalidate({
    cacheName: `flags`,
  }),
);

what if we added additional logging that's enabled in the workbox-routing development builds which checks for fetchEvent.request.headers.get('accept') === 'text/event-stream' and warns when workbox-router attempts to respond to such an event?

In my case, I wasn't even aware that SSE was present on the app - that request was being made by a third party library. I think the sketchy part here is that once we get into that SEE w/ workbox caching state - it's impossible to push a new service-worker (or app in the case of precaching) and you're at the mercy of when the user exits their browser session.

If it's possible to prevent that locked-in condition whether that be by ignoring text/event-stream requests or by calling waitUntil (which I would guess would cover all streaming protocols) I think that's the safest option - it doesn't look like SEE can ever work with workbox's cache anyhow.

If not, maybe documentation on filtering out text/event-stream be the second best option.

Side note - CRA for better or for worse only enables service workers in production builds. Obviously there's more to web dev than just CRA but it's worth noting that a lot of users could encounter this issue and not be aware until it hits production.

andyrichardson avatar Dec 08 '20 22:12 andyrichardson