realm-core Revisit client reset errors

I believe all fatal errors should be considered client reset errors. Currently, we have a fairly peculiar behavior when a user gets a bad changeset, which leaves their Realm in an unsyncable state, but they can't utilize the recovery mechanisms in the client reset handler to restore app functionality. While there's some value in differentiating developer errors from our own errors, we should not be sacrificing end user experience.

https://github.com/realm/realm-core/blob/377a85d2a385a31ec91be7e5fe8c09d22365df97/src/realm/sync/config.cpp#L138-L152

Feb 21 '22 11:02 nirinchev

@ironage can you comment on this?

Feb 22 '22 09:02 jedelbo

A lot of these errors dont really exist any more, but I agree with the basic sentiment. Things like BadChangeset should be considered client reset errors it seems like. Otherwise im not sure how a client would get out of the situation

Feb 22 '22 18:02 tkaye407

We can certainly revisit the classification of individual errors and our error handling in general. But I am not convinced that everything should cause a client reset. I think it is useful to have a fatal error when it is known that retrying will not help (eg: bad_query, switch_to_pbs, switch_to_flx_sync, wrong_protocol_version, unknown_message, bad_syntax...)

For BadChangeset specifically, reapplying the same changeset will most likely result in the same error. For this reason, attempting recovery is not recommended. However, developers are free to apply their client reset logic manually at any time, including on the event of a fatal sync error. They just have to remove all references to the Realm, delete it (optionally back it up) and attempt to open that location again with the normal settings. But it is not something I'd recommend due to the likelihood of creating an error loop.

Feb 22 '22 20:02 ironage

I think the "they just have to remove all references to the Realm, delete it..." part is rather optimistic. Typically when a client reset happens, the app is in the middle of something and for a modular application like most modern ones are, it's non-trivial to centrally shut down all components that may or may not be using Realm at this time. The fact that we also have a distinction between client reset errors and "other errors" guides the developer into thinking that they only need to worry about the client reset ones, so it's unlikely that anyone has thought of deep-diving into our error codes and special casing errors that are fatal but not considered a client reset.

At the end of the day, our goal should be to provide a good user experience for end users, even when the developer messes up. As things currently stand, uploading a changeset that the server rejects cuts you off sync until you uninstall the app and install it again (assuming there's no OS-provided data recovery). That's terrible UX and even if we risk creating error loops in certain situations, it'll still help more users than it'll harm.

Feb 22 '22 21:02 nirinchev

I agree it is not trivial to handle manual client resets right now. We are both reducing the complexity and the frequency of occurrence with upcoming features. But we cannot entirely eliminate the possibility of a fatal error.

In my opinion, if the server rejects a changeset which the client believes to be valid, reinstalling the app may actually be the best course of action because that allows the app to upgrade which can solve a host of problems.

I agree that we have a lot of room to improve the UX around error handling, but I'm wary of providing client reset as a silver bullet.

How about instead of making everything a client reset error, we provide a new category of sync errors which might possibly be recoverable by triggering a DiscardLocal client reset. I suggest that BadChangeset falls into this category. This produces three categories:

Fatal sync error (eg. bad_query, switch_to_flx_sync, wrong_protocol_version etc). There is no good way to handle this other than to reinstall/upgrade app.
Recoverable sync error if data loss is accepted (eg. bad_changeset). The developer can choose to handle this by explicitly triggering a DiscardLocal client reset. If this error is encountered again within some threshold, it becomes a fatal sync error.
Automatically handled sync errors (eg. all currently designated client reset errors). This is handled according to the config's client reset policy as normal.

Feb 22 '22 23:02 ironage

Would an acceptable compromise be to allow users to manually trigger a client reset? So if they receive an error message that wouldn't automatically kick off a client reset they can tell the sync client to behave as though they had received a client reset error. Seems like if our answer to all kinds of problems is to just DiscardLocal reset yourself (which I'm not sure how I feel about as a policy decision, but it seems like where we are right now), we should make it easy to do that?

Mar 09 '22 16:03 jbreams

I think this would be wrong. It increases the burden on users to understand every single error code and figure out what to do when they occur and increases the burden on us to explain it to them and then support them when they mess up.

Mar 09 '22 16:03 nirinchev

Well I think just making all error codes trigger a client reset is wrong. Some of these errors indicate that real invariants in the sync protocol have been violated and it is unsafe to continue. If you get say "bad_compression" or "bad_message_order" - that means something has gone really wrong with how you're communicating with the server and we need to fix it. Client resets are unlikely to actually resolve any issues there. bad_changeset could go either way - it is indicative of real corruption caused by some underlying bug in our software - client resetting may allow you to make forward progress, but it might not.

Mar 09 '22 16:03 jbreams

If we can't say whether a client reset is the right response to a bad changeset error, how would a user be able to? For some of these the discussion is purely theoretical and I don't think it makes sense to spend too much time on it. E.g. I don't think we've ever seen bad_compression or bad_message_order out in the wild.

But my overarching point is that right now our SDKs expose two types of errors:

Transient sync session errors that the sync session should eventually recover from.
Client reset errors.

The expectation is that users should not do anything about 1. and initiate the client reset mechanism for 2. (either automatically or manually). If there's a 3rd category of errors where the server tells the client that it'll never allow it to sync, then we're not surfacing that in an actionable way. I think it might be a good idea to brainstorm this during the offsite together with the cloud team as it had recently come up there as well.

Mar 09 '22 17:03 nirinchev

Another consideration: in partition based sync, it's highly likely that a client that re-downloads the partition again after encountering a bad changeset error will just end up getting the same error, since it'll just be receiving the same history data each time. In flexible sync, for bootstrapping clients the server synthesizes changesets directly from the state store so they are always guaranteed to be valid, and thus should prevent nearly all cases of "bad changeset". In other words, in flexible sync, handling a bad changeset error by performing a client-reset should get the client back up and running again, but in partition based sync a client is likely to never be able to sync once the bad changeset error appears (until it's repaired, either manually or via a lucky compaction run).

Mar 10 '22 13:03 mpobrien

A bad changeset can be triggered by an UPLOAD message in which case a client reset that discards the local realm will recover the client.

Mar 10 '22 13:03 nirinchev

I think we're talking about two different bad changeset error pathways then: one in which the server rejects an incoming upload, vs. the client rejecting an incoming download - the case I was referring to above is just the latter.

Mar 10 '22 13:03 mpobrien

Yeah, my point was that neither currently results in a client reset.

With the download path, there are realistic cases where a client reset could still recover the client. For example, there may be a bug in Core that randomly ignores CREATE_OBJECT instructions. In this case the client would be erroneously rejecting incoming downloads based on its current state. When the bug is fixed, a client reset and a fresh download of the otherwise valid server history would recover connectivity for that client.

Mar 10 '22 13:03 nirinchev

➤ Jonathan Reams commented:

Assigning to [[email protected]] since he'll be taking on the error handling improvement work.

Jun 13 '22 14:06 sync-by-unito[bot]

➤ Daniel Tabacaru commented:

The sync server is now sending an action the client follows. As part of that work all errors were mapped to these actions and, for instance, bad_changeset now triggers ClientResetNoRecovery.

Aug 04 '22 15:08 sync-by-unito[bot]

@nirinchev A lot has changed since this issue was filled. We now have seven categories of errors based on the action the server sends: https://github.com/realm/realm-core/blob/master/src/realm/sync/protocol.hpp#L215-L221. Some of the errors have a different behavior (see example above). I propose we close the issue unless you object.

Sep 27 '22 17:09 danieltabacaru

Yes, that's likely no longer relevant, let's close.

Sep 27 '22 23:09 nirinchev

realm-core realm-core copied to clipboard

Revisit client reset errors

realm-core
realm-core copied to clipboard