feat: focus on safe, compliant, local-first, non-expiring service worker
e.g. remove serviceWorkerRegistrationTTL/timebomb/expiry... a continuation of the original thought behind https://github.com/ipfs/service-worker-gateway/issues/724, without as much verbal communication of the issue, and more documentation of my reasoning.
We chatted about this at yesterday's Helia WG and so I wanted to come back to this with a more thorough argument (that I've voiced before, but never written down).
So.. built some logic tables to map out how service worker updates are actually provided to the user. The tables below compare service worker update scenarios with and without the registration TTL/timebomb mechanism.
Service Worker Update Logic Tables
This was generated with the help of an LLM, and then cleaned up and verified. The summary section at the bottom is 99% human-generated.
Variables
- userOnlineStatus: Online/Offline
- isSwUpdateAvailable: Yes/No (new version exists)
- userNavigationBehavior: Navigates/Stays on same page
- timeSinceLastUpdateCheck: < 24h / ≥ 24h (since last update check that browsers do) see FF and Chromium
- ttlExpired: Yes/No (TTL has expired, only relevant when TTL is used)
- doesUserGetUpdatedSw: Yes/No (final outcome)
Table 1: WITHOUT TTL (Pure Browser Updates)
| Scenario | Online | Update Available | Navigation | Time Since Check | Gets Updated SW | Notes |
|---|---|---|---|---|---|---|
| 1 | ✅ | ✅ | ✅ | < 24h | ✅ | Normal update on navigation |
| 2 | ✅ | ✅ | ✅ | ≥ 24h | ✅ | Update check triggered after 24h |
| 3 | ✅ | ✅ | ❌ | < 24h | ❌ | No navigation = no update check |
| 4 | ✅ | ✅ | ❌ | ≥ 24h | ❌ | No navigation = no update check |
| 5 | ✅ | ❌ | ✅ | < 24h | ❌ | No update needed, current SW works |
| 6 | ✅ | ❌ | ✅ | ≥ 24h | ❌ | No update needed, current SW works |
| 7 | ✅ | ❌ | ❌ | < 24h | ❌ | No update needed, current SW works |
| 8 | ✅ | ❌ | ❌ | ≥ 24h | ❌ | No update needed, current SW works |
| 9 | ❌ | ✅ | ✅ | < 24h | ❌ | Offline = no update check possible |
| 10 | ❌ | ✅ | ✅ | ≥ 24h | ❌ | Offline = no update check possible |
| 11 | ❌ | ✅ | ❌ | < 24h | ❌ | Offline + no navigation = no update |
| 12 | ❌ | ✅ | ❌ | ≥ 24h | ❌ | Offline + no navigation = no update |
| 13 | ❌ | ❌ | ✅ | < 24h | ❌ | No update needed, current SW works |
| 14 | ❌ | ❌ | ✅ | ≥ 24h | ❌ | No update needed, current SW works |
| 15 | ❌ | ❌ | ❌ | < 24h | ❌ | No update needed, current SW works |
| 16 | ❌ | ❌ | ❌ | ≥ 24h | ❌ | No update needed, current SW works |
Table 2: WITH TTL (Browser Updates + TTL Expiration)
| Scenario | Online | Update Available | Navigation | Time Since Check | TTL Expired | Gets Updated SW | Notes |
|---|---|---|---|---|---|---|---|
| 1 | ✅ | ✅ | ✅ | < 24h | ❌ | ✅ | Normal update on navigation |
| 2 | ✅ | ✅ | ✅ | ≥ 24h | ❌ | ✅ | Update check triggered after 24h |
| 3 | ✅ | ✅ | ✅ | < 24h | ✅ | ✅ | TTL expires → SW unregisters → new SW installs |
| 4 | ✅ | ✅ | ✅ | ≥ 24h | ✅ | ✅ | TTL expires → SW unregisters → new SW installs |
| 5 | ✅ | ✅ | ❌ | < 24h | ❌ | ❌ | No navigation = no update check |
| 6 | ✅ | ✅ | ❌ | ≥ 24h | ❌ | ❌ | No navigation = no update check |
| 7 | ✅ | ✅ | ❌ | < 24h | ✅ | ❌ | TTL expires but no navigation to trigger re-registration |
| 8 | ✅ | ✅ | ❌ | ≥ 24h | ✅ | ❌ | TTL expires but no navigation to trigger re-registration |
| 9 | ✅ | ❌ | ✅ | < 24h | ❌ | ❌ | No update needed, current SW works |
| 10 | ✅ | ❌ | ✅ | ≥ 24h | ❌ | ❌ | No update needed, current SW works |
| 11 | ✅ | ❌ | ✅ | < 24h | ✅ | ❌ | TTL expires → SW unregisters → same SW reinstalls |
| 12 | ✅ | ❌ | ✅ | ≥ 24h | ✅ | ❌ | TTL expires → SW unregisters → same SW reinstalls |
| 13 | ✅ | ❌ | ❌ | < 24h | ❌ | ❌ | No update needed, current SW works |
| 14 | ✅ | ❌ | ❌ | ≥ 24h | ❌ | ❌ | No update needed, current SW works |
| 15 | ✅ | ❌ | ❌ | < 24h | ✅ | ❌ | TTL expires → SW unregisters → same SW reinstalls |
| 16 | ✅ | ❌ | ❌ | ≥ 24h | ✅ | ❌ | TTL expires → SW unregisters → same SW reinstalls |
| 17 | ❌ | ✅ | ✅ | < 24h | ❌ | ❌ | Offline = no update check possible |
| 18 | ❌ | ✅ | ✅ | ≥ 24h | ❌ | ❌ | Offline = no update check possible |
| 19 | ❌ | ✅ | ✅ | < 24h | ✅ | ❌ | TTL disabled when offline |
| 20 | ❌ | ✅ | ✅ | ≥ 24h | ✅ | ❌ | TTL disabled when offline |
| 21 | ❌ | ✅ | ❌ | < 24h | ❌ | ❌ | Offline + no navigation = no update |
| 22 | ❌ | ✅ | ❌ | ≥ 24h | ❌ | ❌ | Offline + no navigation = no update |
| 23 | ❌ | ✅ | ❌ | < 24h | ✅ | ❌ | TTL disabled when offline |
| 24 | ❌ | ✅ | ❌ | ≥ 24h | ✅ | ❌ | TTL disabled when offline |
| 25 | ❌ | ❌ | ✅ | < 24h | ❌ | ❌ | No update needed, current SW works |
| 26 | ❌ | ❌ | ✅ | ≥ 24h | ❌ | ❌ | No update needed, current SW works |
| 27 | ❌ | ❌ | ✅ | < 24h | ✅ | ❌ | TTL disabled when offline, but no update needed |
| 28 | ❌ | ❌ | ✅ | ≥ 24h | ✅ | ❌ | TTL disabled when offline, but no update needed |
| 29 | ❌ | ❌ | ❌ | < 24h | ❌ | ❌ | No update needed, current SW works |
| 30 | ❌ | ❌ | ❌ | ≥ 24h | ❌ | ❌ | No update needed, current SW works |
| 31 | ❌ | ❌ | ❌ | < 24h | ✅ | ❌ | TTL disabled when offline, but no update needed |
| 32 | ❌ | ❌ | ❌ | ≥ 24h | ✅ | ❌ | TTL disabled when offline, but no update needed |
Key Differences: TTL vs No TTL
Scenarios Where TTL Actually Makes a Difference
Looking at the logic tables, the TTL mechanism only provides a benefit in one specific scenario:
| Scenario | Without TTL | With TTL | Actual Difference |
|---|---|---|---|
| User goes offline with old SW → update available while offline → user stays offline indefinitely | ✅ SW remains active, cached content accessible | ❌ SW unregisters when TTL expires, cached content becomes inaccessible | TTL removes access to cached content (this could be badbits, but probably not..) |
Scenarios Where TTL Provides No Benefit
All other scenarios show identical outcomes between TTL and no-TTL:
| Scenario | Without TTL | With TTL | Difference |
|---|---|---|---|
| User offline → update available → comes back online | ✅ Gets update on navigation | ✅ Gets update on navigation | None |
| User online with update available | ✅ Gets update on navigation | ✅ Gets update on navigation | None |
| User stays on same page for extended period | ❌ No update (no navigation) | ❌ No update (no navigation) | None |
| No update available | ✅ Current SW works | ✅ Current SW works | None |
TTL Limitations
- TTL check only runs during fetch events: If service worker is idle, TTL expiration won't be detected until next navigation
-
TTL disabled when offline: When
navigator.onLine === false, TTL check returnstrue(no expiration) - Requires navigation: Even when TTL expires, user must navigate to trigger service worker re-registration
Analysis
Assuming we implement badbits blocking in the service worker gateway:
The Only Real Benefit of TTL
Content Takedown for Offline Users: The TTL mechanism can remove access to cached content (including potentially problematic content) for users who go offline and never come back online. Without TTL, these users would retain access to cached content indefinitely. But they need to explicitly attempt to load the content again for it to fire.
The Real Cost of TTL
Degraded Offline Experience: Users who rely on the service worker for offline functionality may lose access to cached content when the TTL expires, even if they're still offline and the content is still valid.
Key Insight
The serviceWorkerRegistrationTTL is a content takedown mechanism that works by unregistering the service worker after a time period, which removes access to all cached content. This is useful for:
- Legal compliance: Removing access to content that has been flagged as problematic
Summary
The "bad deployment recovery" argument is false
TTL doesn't help with nasty bugs because users still won't get updates unless they meet the same browser update requirements (online + navigation). The only way to deploy a non-updatable service worker is by changing the sw path in registration, which we've already solved with the namespaced ipfs-sw-sw.js path. (don't ever change this.. there is a test that confirms it exists in the output. we could probably override types for navigator.serviceWorker.register as well, and in that type override, link to this issue).
TTL degrades experience for legitimate users
Since we have no way to know if a user has accessed badbits content, TTL will break offline functionality for users accessing legitimate IPFS content.
More thoughts...
IANAL -- but if we provide the updated code, the above logic table should be sufficient for proving that a user is keeping access to that content intentionally, and that if they do chose to do so, TTL will only break them if they explicitly reload the page, and we have little control over that. And by breaking users who are explicitly choosing to keep bad content, we are breaking 99% of users who are not doing so.
The below is very related to https://github.com/ipfs/service-worker-gateway/issues/72.
I think that it's also worth noting that badbits are already filtered out with the default config (trustless-gateway), though we are still doing network requests to peers that may not be blocking badbits.
We should find some solution to block badbits in the service-worker gateway, and then remove TTL. Still, with badbits check in the service worker, the users can still choose to block their service worker from ever updating, and deny any future updates to what is blocked. I think this would be acceptable
Some potential solutions for badbits in the service worker:
- We could implement badbits on the hosting layer so that subdomains and path requests to blocked content always returns an error. This allows us to quickly "hotfix" the badbits list prior to the public badlist getting updated.. but users can still get around this if they explicitly want to.
- We could host a badbits server that makes cachable requests/responses for badbits checks.
- We could implement badbits in the service worker directly.
- This requires a smart Xor filter + MPHF + fingerprint.
- The list is ~30Mb now, we can't fit that in the service worker..
- A rough estimate of a combined xor filter + mphf + 32-bit fingerprint (for 500k entries in the badbits list) would be about 2.5MiB (
S_mib(numKeys) = numKeys * (xorBitsPerKey/*9*/ + mphfBitsPerKey/*2.62*/ + fingerprintBitsPerKey/*32*/) / (8 * 2^20)) --- This setup would give us, based on 614M (see non-unique) requests to IPFS gateways daily, roughly a 18.4% probability of at least one false positive in a year (i.e. blocking an item thats not in the badbits list).
- bumping to 40-bit fingerprint (bump SW size to 2.94MiB) would give us a 0.0796% probability of at least one FP in a year.
- bumping to 64-bit fingerprint (bump SW size to 4.32MiB) would give us a 0.0000000000475% probability of at least one FP in a year.
- Note that inbrowser.
does not receive this high of traffic, and that the badbits list will grow.
- extend badbits check into a consensus service provided by kubo and other nodes... this would be a little overkill I think.
Thank for writing this down.
If I understand the gist of this analysis, the recommendations are:
-
Remove SW TTL (timebomb)
- 👍 makes sense
- short-term: #841 fixes how timebomb is executed, should be good for now
- long-term: removal of timebomb sgtm
- in my mind it was never a mechanism for protecting from badbits (if phishing loaded its already game over, expiring it does not matter, user already got tricked) -- it was mainly a way to avoid waiting 24h for users to update if we shipped broken JS (which, to be fair, happened in early iterations)
-
Implement native badbits support
- ⚠️ this one is a cna of worms, need a wider ecosystem plan first. most of users will not run public open gateway.
- current approach at
inbrowser.linkapplies badbits before returning service worker installer (reusing what we do ondweb.linkalready) - unsure how feasible native support is:
- badbits being published as bloomfilter-thing sgtm, but depends on extra dev work upstream at https://badbits.dwebops.pub pipeline + writing a spec for how filter looks like, similar to https://specs.ipfs.tech/compact-denylist-format/
- once we have filter produced, and client capable of reading it, this would have to be continuous: fetching 3MiB bloomfilter-thing one time is not enough, we would have to check for new list every 5 minutes, the same way we do on
ipfs.io- badbits updates multiple times a day, and it could mean extra 3MiB being fetched in the background multiple times a day
- personally, I'm leaning towards badbits being more like a "family filter DNS resolver" (example) -- implemented not on the client, but on the delegated routing endpoint
-
https://delegated-ipfs.dev/routing/v1/would remain as-is unfiltered, but public gateway service worker at inbrowser.link would switch to a dedicated moderated instance athttps://filtered.delegated-ipfs.dev/routing/v1/which skips results for blocked CIDs and domains.
-
- current approach at
- ⚠️ this one is a cna of worms, need a wider ecosystem plan first. most of users will not run public open gateway.
which, to be fair, happened in early iterations
Yea it sure did. We didn't have a locked SW name at that time; it had the chunk hash in the SW name, and then we renamed it once more. Now, as long as ipfs-sw-sw.js doesn't change names, users will always get the latest if online and attempt to access it.
we would have to check for new list every 5 minutes
I was not aware it was updated this frequently. That definitely changes things.. but I think it would be possible to do a mini badbits-cdn for a compressed js filter, and a client side loader of that filter, so folks can implement checks in JS land that do the right thing.
[dns filter] on the delegated routing endpoint
Great idea. That should solve helia/js needs, especially if we're not doing DHT things connecting to peers directly. This would be a much easier step forward to filtering badbits in the JS ecosystem.