realm-core
realm-core copied to clipboard
Revisit client reset errors
I believe all fatal errors should be considered client reset errors. Currently, we have a fairly peculiar behavior when a user gets a bad changeset, which leaves their Realm in an unsyncable state, but they can't utilize the recovery mechanisms in the client reset handler to restore app functionality. While there's some value in differentiating developer errors from our own errors, we should not be sacrificing end user experience.
https://github.com/realm/realm-core/blob/377a85d2a385a31ec91be7e5fe8c09d22365df97/src/realm/sync/config.cpp#L138-L152
@ironage can you comment on this?
A lot of these errors dont really exist any more, but I agree with the basic sentiment. Things like BadChangeset should be considered client reset errors it seems like. Otherwise im not sure how a client would get out of the situation
We can certainly revisit the classification of individual errors and our error handling in general. But I am not convinced that everything should cause a client reset. I think it is useful to have a fatal error when it is known that retrying will not help (eg: bad_query, switch_to_pbs, switch_to_flx_sync, wrong_protocol_version, unknown_message, bad_syntax...)
For BadChangeset specifically, reapplying the same changeset will most likely result in the same error. For this reason, attempting recovery is not recommended. However, developers are free to apply their client reset logic manually at any time, including on the event of a fatal sync error. They just have to remove all references to the Realm, delete it (optionally back it up) and attempt to open that location again with the normal settings. But it is not something I'd recommend due to the likelihood of creating an error loop.
I think the "they just have to remove all references to the Realm, delete it..." part is rather optimistic. Typically when a client reset happens, the app is in the middle of something and for a modular application like most modern ones are, it's non-trivial to centrally shut down all components that may or may not be using Realm at this time. The fact that we also have a distinction between client reset errors and "other errors" guides the developer into thinking that they only need to worry about the client reset ones, so it's unlikely that anyone has thought of deep-diving into our error codes and special casing errors that are fatal but not considered a client reset.
At the end of the day, our goal should be to provide a good user experience for end users, even when the developer messes up. As things currently stand, uploading a changeset that the server rejects cuts you off sync until you uninstall the app and install it again (assuming there's no OS-provided data recovery). That's terrible UX and even if we risk creating error loops in certain situations, it'll still help more users than it'll harm.
I agree it is not trivial to handle manual client resets right now. We are both reducing the complexity and the frequency of occurrence with upcoming features. But we cannot entirely eliminate the possibility of a fatal error.
In my opinion, if the server rejects a changeset which the client believes to be valid, reinstalling the app may actually be the best course of action because that allows the app to upgrade which can solve a host of problems.
I agree that we have a lot of room to improve the UX around error handling, but I'm wary of providing client reset as a silver bullet.
How about instead of making everything a client reset error, we provide a new category of sync errors which might possibly be recoverable by triggering a DiscardLocal
client reset. I suggest that BadChangeset
falls into this category. This produces three categories:
- Fatal sync error (eg. bad_query, switch_to_flx_sync, wrong_protocol_version etc). There is no good way to handle this other than to reinstall/upgrade app.
- Recoverable sync error if data loss is accepted (eg. bad_changeset). The developer can choose to handle this by explicitly triggering a
DiscardLocal
client reset. If this error is encountered again within some threshold, it becomes a fatal sync error. - Automatically handled sync errors (eg. all currently designated client reset errors). This is handled according to the config's client reset policy as normal.
Would an acceptable compromise be to allow users to manually trigger a client reset? So if they receive an error message that wouldn't automatically kick off a client reset they can tell the sync client to behave as though they had received a client reset error. Seems like if our answer to all kinds of problems is to just DiscardLocal reset yourself (which I'm not sure how I feel about as a policy decision, but it seems like where we are right now), we should make it easy to do that?
I think this would be wrong. It increases the burden on users to understand every single error code and figure out what to do when they occur and increases the burden on us to explain it to them and then support them when they mess up.
Well I think just making all error codes trigger a client reset is wrong. Some of these errors indicate that real invariants in the sync protocol have been violated and it is unsafe to continue. If you get say "bad_compression" or "bad_message_order" - that means something has gone really wrong with how you're communicating with the server and we need to fix it. Client resets are unlikely to actually resolve any issues there. bad_changeset could go either way - it is indicative of real corruption caused by some underlying bug in our software - client resetting may allow you to make forward progress, but it might not.
If we can't say whether a client reset is the right response to a bad changeset error, how would a user be able to? For some of these the discussion is purely theoretical and I don't think it makes sense to spend too much time on it. E.g. I don't think we've ever seen bad_compression
or bad_message_order
out in the wild.
But my overarching point is that right now our SDKs expose two types of errors:
- Transient sync session errors that the sync session should eventually recover from.
- Client reset errors.
The expectation is that users should not do anything about 1. and initiate the client reset mechanism for 2. (either automatically or manually). If there's a 3rd category of errors where the server tells the client that it'll never allow it to sync, then we're not surfacing that in an actionable way. I think it might be a good idea to brainstorm this during the offsite together with the cloud team as it had recently come up there as well.
Another consideration: in partition based sync, it's highly likely that a client that re-downloads the partition again after encountering a bad changeset error will just end up getting the same error, since it'll just be receiving the same history data each time. In flexible sync, for bootstrapping clients the server synthesizes changesets directly from the state store so they are always guaranteed to be valid, and thus should prevent nearly all cases of "bad changeset". In other words, in flexible sync, handling a bad changeset error by performing a client-reset should get the client back up and running again, but in partition based sync a client is likely to never be able to sync once the bad changeset error appears (until it's repaired, either manually or via a lucky compaction run).
A bad changeset can be triggered by an UPLOAD message in which case a client reset that discards the local realm will recover the client.
I think we're talking about two different bad changeset error pathways then: one in which the server rejects an incoming upload, vs. the client rejecting an incoming download - the case I was referring to above is just the latter.
Yeah, my point was that neither currently results in a client reset.
With the download path, there are realistic cases where a client reset could still recover the client. For example, there may be a bug in Core that randomly ignores CREATE_OBJECT instructions. In this case the client would be erroneously rejecting incoming downloads based on its current state. When the bug is fixed, a client reset and a fresh download of the otherwise valid server history would recover connectivity for that client.
➤ Jonathan Reams commented:
Assigning to [[email protected]] since he'll be taking on the error handling improvement work.
➤ Daniel Tabacaru commented:
The sync server is now sending an action the client follows. As part of that work all errors were mapped to these actions and, for instance, bad_changeset now triggers ClientResetNoRecovery.
@nirinchev A lot has changed since this issue was filled. We now have seven categories of errors based on the action the server sends: https://github.com/realm/realm-core/blob/master/src/realm/sync/protocol.hpp#L215-L221. Some of the errors have a different behavior (see example above). I propose we close the issue unless you object.
Yes, that's likely no longer relevant, let's close.