ipns: only use cache for prefetching then do confirmation with actual dht
I had a report from a user complaining about stale IPNS records.
After trying out a small repro on my machine, it is fairly easy to have 1h+ sync times because we will cache IPNS records for 1h without ever challenging them.
The IPNS code is surprising, it implements rollback based resolution, so it can get bad candidates early, start next recursive resolution, then when it get better candidates it can rollback and reconciliate the previous name resolution with the newer better results (or not if better candidates lines up in the existing timeline). This allows to optimistically start with bad candidates and validate or cancel and restart with better ones later.
I think the cache should be used as a first untrusted candidate, so when a cache entry is found, we add as the first candidate in the rollback resolution process, but then we proceed with the rollback resolution as usual.
This would also combine nicely with #397 (altho any of the two is still a good improvement)
For people interested to look at fixing it: The whole rollback resolution code is based at: https://github.com/ipfs/boxo/blob/76d9292da780aaa8c1904cc73e3e03d87a46557e/namesys/utilities.go#L34
Through interfaces it then callback into: https://github.com/ipfs/boxo/blob/76d9292da780aaa8c1904cc73e3e03d87a46557e/namesys/namesys.go#L156
Here is when the cache is checked: https://github.com/ipfs/boxo/blob/76d9292da780aaa8c1904cc73e3e03d87a46557e/namesys/namesys.go#L175-L181
See return out, I think the early return should be removed and the code should continue execution bellow (after writing cached entry to the channel).
it is fairly easy to have 1h+ sync times because we will cache IPNS records for 1h without ever challenging them.
When one publishes IPNS record with ipfs name publish the implicit default for caching hint is --ttl 1h,
which seems to align with your observation ("stale" record is cached for ~1h before checking for update).
What was the TTL of the record you used in your test?
Do you still see "~1h sync problem" if you resolve IPNS name published with --ttl set to 1m?
My mental model is that IPNS caching based on TTL is a feature similar to TTL in DNS records. Services should leverage TTL for caching, avoid work and only check for updates when TTL window expires.
I think the cache should be used as a first untrusted candidate, so when a cache entry is found, we add as the first candidate in the rollback resolution process, but then we proceed with the rollback resolution as usual.
Are you suggesting we should always do lookup for updated records, even when we have a valid record cached with TTL still being in effect? Or did I miss some nuance around rollback?
cc @hacdias since we've made cache changes related to TTL around Kubo 0.24, I believe our intention was to avoid doing lookup for updates if we have had a valid result in cache for less than TTL window.
Do you still see "~1h sync problem" if you resolve IPNS name published with --ttl set to 1m?
No, however this is not controlled by the user in question, they are running a gateway and their users are complaining that the IPNS records don't update as fast as ipfs.io, my guess is that their users are being loadbalanced to different ipfs.io instances and thus not experiencing caching.
My mental model is that IPNS caching based on TTL is a feature similar to TTL in DNS records. Services should leverage TTL for caching, avoid work and only check for updates when TTL window expires.
Then ipfs name publish should not have a default --ttl and ask the user to provide one instead ?
Or maybe it should be 1m ?
Are you suggesting we should always do lookup for updated records, even when we have a valid record cached with TTL still being in effect? Or did I miss some nuance around rollback?
Yes, maybe a 10s~1m hard cache could be useful, but yes.
cc @hacdias since we've made cache changes related to TTL around Kubo 0.24, I believe our intention was to avoid doing lookup for updates if we have had a valid result in cache for less than TTL window.
Yes, that was the goal.
Then
ipfs name publishshould not have a default--ttland ask the user to provide one instead ? Or maybe it should be1m?
This has already been discussed: https://github.com/ipfs/specs/pull/371. The idea was to make it similar to what is already expected from DNS, for example. I don't think the default is bad. The publisher of the website should be responsible for setting the best TTL for their use case. If they are changing the website all the time, they should also create records with a very short TTL.
Triage notes (updated by @lidel):
-
Caching is useful when used correctly, and should NOT be disabled by default
-
TTL of IPNS record is separate from its signature expiration date. Both are something a publisher controls, not gateway, but only TTL can be adjusted on gateway (just like DNS record's TTL is up to publisher, and not DNS server, but resolver can put "max ceiling" on it as allowed by RFC2181)
-
What eth.limo really wants is to ignore TTL, so to do that they need to disable IPNS caching, or set "max cache TTL" similar to DNS.MaxCacheTTL.
-
Next step here:
- Firstly, we should add
IPNS.MaxCacheTTLand make it work the same way already existingDNS.MaxCacheTTLdoes (this should be part of boxo/namesys, as an optional feature that puts a ceiling on TTL used for caching of IPNS Records). - We could also change config option Ipns.ResolveCacheSize to Priority type, such that the cache can be disabled by setting JSON value to
false, and disable caching if set to false, butIPNS.MaxCacheTTLfeels more useful (they can still adjust it to something low and useful, like 1 minute)
- Firstly, we should add