service-worker-gateway icon indicating copy to clipboard operation
service-worker-gateway copied to clipboard

feat: focus on safe, compliant, local-first, non-expiring service worker

Open SgtPooki opened this issue 4 months ago • 2 comments

e.g. remove serviceWorkerRegistrationTTL/timebomb/expiry... a continuation of the original thought behind https://github.com/ipfs/service-worker-gateway/issues/724, without as much verbal communication of the issue, and more documentation of my reasoning.

We chatted about this at yesterday's Helia WG and so I wanted to come back to this with a more thorough argument (that I've voiced before, but never written down).

So.. built some logic tables to map out how service worker updates are actually provided to the user. The tables below compare service worker update scenarios with and without the registration TTL/timebomb mechanism.

Service Worker Update Logic Tables

This was generated with the help of an LLM, and then cleaned up and verified. The summary section at the bottom is 99% human-generated.

Variables

  • userOnlineStatus: Online/Offline
  • isSwUpdateAvailable: Yes/No (new version exists)
  • userNavigationBehavior: Navigates/Stays on same page
  • timeSinceLastUpdateCheck: < 24h / ≥ 24h (since last update check that browsers do) see FF and Chromium
  • ttlExpired: Yes/No (TTL has expired, only relevant when TTL is used)
  • doesUserGetUpdatedSw: Yes/No (final outcome)

Table 1: WITHOUT TTL (Pure Browser Updates)

Scenario Online Update Available Navigation Time Since Check Gets Updated SW Notes
1 < 24h Normal update on navigation
2 ≥ 24h Update check triggered after 24h
3 < 24h No navigation = no update check
4 ≥ 24h No navigation = no update check
5 < 24h No update needed, current SW works
6 ≥ 24h No update needed, current SW works
7 < 24h No update needed, current SW works
8 ≥ 24h No update needed, current SW works
9 < 24h Offline = no update check possible
10 ≥ 24h Offline = no update check possible
11 < 24h Offline + no navigation = no update
12 ≥ 24h Offline + no navigation = no update
13 < 24h No update needed, current SW works
14 ≥ 24h No update needed, current SW works
15 < 24h No update needed, current SW works
16 ≥ 24h No update needed, current SW works

Table 2: WITH TTL (Browser Updates + TTL Expiration)

Scenario Online Update Available Navigation Time Since Check TTL Expired Gets Updated SW Notes
1 < 24h Normal update on navigation
2 ≥ 24h Update check triggered after 24h
3 < 24h TTL expires → SW unregisters → new SW installs
4 ≥ 24h TTL expires → SW unregisters → new SW installs
5 < 24h No navigation = no update check
6 ≥ 24h No navigation = no update check
7 < 24h TTL expires but no navigation to trigger re-registration
8 ≥ 24h TTL expires but no navigation to trigger re-registration
9 < 24h No update needed, current SW works
10 ≥ 24h No update needed, current SW works
11 < 24h TTL expires → SW unregisters → same SW reinstalls
12 ≥ 24h TTL expires → SW unregisters → same SW reinstalls
13 < 24h No update needed, current SW works
14 ≥ 24h No update needed, current SW works
15 < 24h TTL expires → SW unregisters → same SW reinstalls
16 ≥ 24h TTL expires → SW unregisters → same SW reinstalls
17 < 24h Offline = no update check possible
18 ≥ 24h Offline = no update check possible
19 < 24h TTL disabled when offline
20 ≥ 24h TTL disabled when offline
21 < 24h Offline + no navigation = no update
22 ≥ 24h Offline + no navigation = no update
23 < 24h TTL disabled when offline
24 ≥ 24h TTL disabled when offline
25 < 24h No update needed, current SW works
26 ≥ 24h No update needed, current SW works
27 < 24h TTL disabled when offline, but no update needed
28 ≥ 24h TTL disabled when offline, but no update needed
29 < 24h No update needed, current SW works
30 ≥ 24h No update needed, current SW works
31 < 24h TTL disabled when offline, but no update needed
32 ≥ 24h TTL disabled when offline, but no update needed

Key Differences: TTL vs No TTL

Scenarios Where TTL Actually Makes a Difference

Looking at the logic tables, the TTL mechanism only provides a benefit in one specific scenario:

Scenario Without TTL With TTL Actual Difference
User goes offline with old SW → update available while offline → user stays offline indefinitely ✅ SW remains active, cached content accessible ❌ SW unregisters when TTL expires, cached content becomes inaccessible TTL removes access to cached content (this could be badbits, but probably not..)

Scenarios Where TTL Provides No Benefit

All other scenarios show identical outcomes between TTL and no-TTL:

Scenario Without TTL With TTL Difference
User offline → update available → comes back online ✅ Gets update on navigation ✅ Gets update on navigation None
User online with update available ✅ Gets update on navigation ✅ Gets update on navigation None
User stays on same page for extended period ❌ No update (no navigation) ❌ No update (no navigation) None
No update available ✅ Current SW works ✅ Current SW works None

TTL Limitations

  • TTL check only runs during fetch events: If service worker is idle, TTL expiration won't be detected until next navigation
  • TTL disabled when offline: When navigator.onLine === false, TTL check returns true (no expiration)
  • Requires navigation: Even when TTL expires, user must navigate to trigger service worker re-registration

Analysis

Assuming we implement badbits blocking in the service worker gateway:

The Only Real Benefit of TTL

Content Takedown for Offline Users: The TTL mechanism can remove access to cached content (including potentially problematic content) for users who go offline and never come back online. Without TTL, these users would retain access to cached content indefinitely. But they need to explicitly attempt to load the content again for it to fire.

The Real Cost of TTL

Degraded Offline Experience: Users who rely on the service worker for offline functionality may lose access to cached content when the TTL expires, even if they're still offline and the content is still valid.

Key Insight

The serviceWorkerRegistrationTTL is a content takedown mechanism that works by unregistering the service worker after a time period, which removes access to all cached content. This is useful for:

  1. Legal compliance: Removing access to content that has been flagged as problematic

Summary

The "bad deployment recovery" argument is false

TTL doesn't help with nasty bugs because users still won't get updates unless they meet the same browser update requirements (online + navigation). The only way to deploy a non-updatable service worker is by changing the sw path in registration, which we've already solved with the namespaced ipfs-sw-sw.js path. (don't ever change this.. there is a test that confirms it exists in the output. we could probably override types for navigator.serviceWorker.register as well, and in that type override, link to this issue).

TTL degrades experience for legitimate users

Since we have no way to know if a user has accessed badbits content, TTL will break offline functionality for users accessing legitimate IPFS content.

More thoughts...

IANAL -- but if we provide the updated code, the above logic table should be sufficient for proving that a user is keeping access to that content intentionally, and that if they do chose to do so, TTL will only break them if they explicitly reload the page, and we have little control over that. And by breaking users who are explicitly choosing to keep bad content, we are breaking 99% of users who are not doing so.


The below is very related to https://github.com/ipfs/service-worker-gateway/issues/72.

I think that it's also worth noting that badbits are already filtered out with the default config (trustless-gateway), though we are still doing network requests to peers that may not be blocking badbits.

We should find some solution to block badbits in the service-worker gateway, and then remove TTL. Still, with badbits check in the service worker, the users can still choose to block their service worker from ever updating, and deny any future updates to what is blocked. I think this would be acceptable

Some potential solutions for badbits in the service worker:

  • We could implement badbits on the hosting layer so that subdomains and path requests to blocked content always returns an error. This allows us to quickly "hotfix" the badbits list prior to the public badlist getting updated.. but users can still get around this if they explicitly want to.
  • We could host a badbits server that makes cachable requests/responses for badbits checks.
  • We could implement badbits in the service worker directly.
    • This requires a smart Xor filter + MPHF + fingerprint.
    • The list is ~30Mb now, we can't fit that in the service worker..
    • A rough estimate of a combined xor filter + mphf + 32-bit fingerprint (for 500k entries in the badbits list) would be about 2.5MiB (S_mib(numKeys) = numKeys * (xorBitsPerKey/*9*/ + mphfBitsPerKey/*2.62*/ + fingerprintBitsPerKey/*32*/) / (8 * 2^20)) --
      • This setup would give us, based on 614M (see non-unique) requests to IPFS gateways daily, roughly a 18.4% probability of at least one false positive in a year (i.e. blocking an item thats not in the badbits list).
      • bumping to 40-bit fingerprint (bump SW size to 2.94MiB) would give us a 0.0796% probability of at least one FP in a year.
      • bumping to 64-bit fingerprint (bump SW size to 4.32MiB) would give us a 0.0000000000475% probability of at least one FP in a year.
    • Note that inbrowser. does not receive this high of traffic, and that the badbits list will grow.
  • extend badbits check into a consensus service provided by kubo and other nodes... this would be a little overkill I think.

SgtPooki avatar Sep 05 '25 18:09 SgtPooki

Thank for writing this down.

If I understand the gist of this analysis, the recommendations are:

  1. Remove SW TTL (timebomb)
    • 👍 makes sense
    • short-term: #841 fixes how timebomb is executed, should be good for now
    • long-term: removal of timebomb sgtm
      • in my mind it was never a mechanism for protecting from badbits (if phishing loaded its already game over, expiring it does not matter, user already got tricked) -- it was mainly a way to avoid waiting 24h for users to update if we shipped broken JS (which, to be fair, happened in early iterations)
  2. Implement native badbits support
    • ⚠️ this one is a cna of worms, need a wider ecosystem plan first. most of users will not run public open gateway.
      • current approach at inbrowser.link applies badbits before returning service worker installer (reusing what we do on dweb.link already)
      • unsure how feasible native support is:
        • badbits being published as bloomfilter-thing sgtm, but depends on extra dev work upstream at https://badbits.dwebops.pub pipeline + writing a spec for how filter looks like, similar to https://specs.ipfs.tech/compact-denylist-format/
        • once we have filter produced, and client capable of reading it, this would have to be continuous: fetching 3MiB bloomfilter-thing one time is not enough, we would have to check for new list every 5 minutes, the same way we do on ipfs.io
          • badbits updates multiple times a day, and it could mean extra 3MiB being fetched in the background multiple times a day
      • personally, I'm leaning towards badbits being more like a "family filter DNS resolver" (example) -- implemented not on the client, but on the delegated routing endpoint
        • https://delegated-ipfs.dev/routing/v1/ would remain as-is unfiltered, but public gateway service worker at inbrowser.link would switch to a dedicated moderated instance at https://filtered.delegated-ipfs.dev/routing/v1/ which skips results for blocked CIDs and domains.

lidel avatar Sep 08 '25 23:09 lidel

which, to be fair, happened in early iterations

Yea it sure did. We didn't have a locked SW name at that time; it had the chunk hash in the SW name, and then we renamed it once more. Now, as long as ipfs-sw-sw.js doesn't change names, users will always get the latest if online and attempt to access it.

we would have to check for new list every 5 minutes

I was not aware it was updated this frequently. That definitely changes things.. but I think it would be possible to do a mini badbits-cdn for a compressed js filter, and a client side loader of that filter, so folks can implement checks in JS land that do the right thing.

[dns filter] on the delegated routing endpoint

Great idea. That should solve helia/js needs, especially if we're not doing DHT things connecting to peers directly. This would be a much easier step forward to filtering badbits in the JS ecosystem.

SgtPooki avatar Sep 15 '25 14:09 SgtPooki