Realm Sync: Weakness With Shared Relationships
Summary:
The first rule of Realm Sync ("deletes always win") can produce data loss when combined with a fairly ubiquitous model design pattern. This is not a bug, but rather a design decision in Realm that I think could be improved? I'm unsure if Core is the right spot; there does not seem to be a repo related to Atlas Device Sync.
Context:
Consider these two objects (in Swift, here):
final class AudioClip: Object
{
@Persisted(primaryKey: true) var _id: UUID
@Persisted var path: Filepath?
}
final class Filepath: Object
{
@Persisted(primaryKey: true) var _id: UUID
@Persisted var path: String = ""
@Persisted(originProperty: "path") var associatedAudioClips: LinkingObjects<AudioClip>
}
Suppose we have millions of AudioClip objects in our database but only a few thousand Filepath objects. Instead of duplicating the same long, constant String (a file path) millions of times, we store that String only once and "re-use" it across many AudioClips.
(I do understand that Mongo's advice is "prefer embedding". But in this case, we'd be duplicating the exact same String millions of times, so the size-savings is significant.)
The Issue:
When our app deletes an AudioClip, it would like to make sure no orphaned Filepath objects remain. So the app might inspect AudioClip.path and if associatedAudioClips has only one entry, the app can delete the Filepath object that's no longer used by any other AudioClip.
But, suppose the app is offline when that delete occurs. During that offline period, another user on a different device adds a new AudioClip item and sets path to reference the Filepath object that has been deleted in the offline session.
When the offline user reconnects, the first rule of the conflict resolution algorithm kicks in and the Filepath is deleted, leaving the new AudioClip item that the second user added with nil for its path property—unexpected corruption.
The worst part: There is never a "safe" time to delete a Filepath object. The only way to do so with 100% certainty that no database corruption will occur is to disable sync, perform the deletes, then force a client-reset on all users.
A Fix?
"Deletes always win" works well when the conflict is: "One person changed a property of X and another person deleted X."
But in the case of relationships, sync needs a recycle bin. When a delete occurs, the object should not be actually vaporized until the expiration of the client-reset period configured in Atlas. Because, at any time during that window, another user may arrive at the sync server with a request to create a relationship to that deleted object. When the second user created that relationship, he had no way of knowing the destination object was "doomed". On his device (until the first user syncs up), the relationship is assigned and valid and, critically:
- The first user would not have performed the delete if he had knowledge of both changes.
- The second user would have created a different
Filepathobject so that thepathproperty is not nil if he had known about both changes.
The sync server knows about both changes. And resolving the conflict by allowing the assignment after the delete still converges on a consistent end state for both users. And that end state is a better one than the current end state, where path on the newly-added AudioClip object is unexpectedly nil.
Alternative
Look, I get it: sync is basically the hardest problem there is. And the Four Laws probably exist as they do for a reason. But this shared-reference pattern is very common and right now there's a giant pit waiting to snare people.
At the very least, if nothing can be done to make sync handle this better, this page should be updated with an example/warning. The current example of "delete vs. modify" is a very trivial one.
➤ PM Bot commented:
Jira ticket: RCORE-2117
An alternative design would be to "soft-delete" these objects. That is, add an optional date field called deletedOn and set it to the timestamp of the deletion. Then you can have clients never create an audioclip linking to a file path that has been marked for deletion (i.e. they'd have to create a new file path if the only one available has deletedOn set). Finally, you can run a scheduled trigger that looks at all filepaths marked for deletion that are older than some period - e.g. 90 days and delete them after migrating audio clips that may still be pointing to them to a new FilePath link.
@nirinchev Clever! But the same problem persists: the user who is linking an AudioClip to a Filepath has no way to know if that latter object is already doomed. deletedOn could very well be nil, but is it nil because the object hasn't been deleted or simply because the user who deleted it hasn't synced up his change yet? Can't know. So it's still possible to create a relationship that will be vaporized behind my back.
Hence the need for a "final safety check" in the Trigger that culls soft-deleted objects after 90 days: have to check for any lingering references and then reassign them.
I'm sure a change to the "deletes win" rule would introduce other edge cases--even if the change applied only to relationships between objects. But again: the design pattern in this example is pretty common and the current dangers in how sync handles it aren't very obvious.
@bdkjones How do you distinguish between the object being deleted because no one is referencing it and the user deleting it for other reasons (as with a regular link)? In both cases, the user actually explicitly deletes the object, so it's hard to know the intent. We could annotate the type and use that info when doing conflict resolution.
@danieltabacaru Consider this scenario where we have Realm's usual example (an app with Owner and Dog and the usual relationship between them).
The Problem:
-
There are two users of our app: Alice and Bob. We're a dog daycare center.
-
At noon, Alice uses the app. She notes that one of our customers, Steve, had a dog that died. She removes that dog from the app. Since Steve has no other dogs, she decides to delete Steve from the app as well to keep our customer list fresh.
-
At 11AM, Bob has no Internet connection and is using the app in the "offline" mode. He adds a new dog to the app and assigns Steve as its owner—Bob has information that Alice does not: Steve got a new dog.
-
When Bob reconnects at 3PM, the owner Steve is vaporized and the new dog he added to the Realm is now orphaned: it has a
nilowner and will probably just exist that way forever. (When someone sees that Steve is missing, they'll add a newOwnerand a newDograther than ask, "Huh, I wonder if Steve's dog is already in the database and just not assigned to any owner at all?" Our app's UI isn't going to show "unassigned" dogs because those aren't supposed to exist!)
The Solution:
Clearly, the above sequence is not ideal. We end up with database corruption, orphaned objects, nil relationships that should never be nil, etc. And it's impossible to guard against as the developer: when Bob's app is adding the new Dog to Steve, everything is 100% valid and we have no way of knowing that, an hour in the future, Alice is going to delete Steve entirely and this new Dog is going to be orphaned.
The general fix is:
-
When an object is deleted, it is retained for the duration of the client-reset period.
-
If, during that period, a new reference to the deleted object is created, the object is "undeleted" and the relationship is fulfilled.
Why:
Think about the ideal outcome. Here, Alice would not have deleted the Owner object if she had known what Bob knew. But consider the worst case, where the delete really should "win" and Bob's changes/additions are irrelevant: we end up needing to delete the Owner one more time after Bob's changes sync up.
That's a better outcome because it doesn't leave any orphaned, "invisible" objects in the realm.
Understanding the "intent" behind a delete isn't necessary. The priority should be resolving sync conflicts in a way that does not result in sudden, unanticipated orphaned objects and nil relationships.
@danieltabacaru Also worth pointing out that this problem really only manifests when multiple users are accessing the same database. Realm's original design seems to have been more focused on a single user accessing his own data. While it's still possible to hit this edge case under that condition, it's much less likely than it is when multiple users are all acting on a single database.
@bdkjones The example is clear and I see where you're coming from.
I suppose there is always the alternative to not delete the owners.
But consider the worst case, where the delete really should "win" and Bob's changes/additions are irrelevant: we end up needing to delete the Owner one more time after Bob's changes sync up.
That if Alice remembers that Steve was supposed to be deleted for good (and it wasn't actually) and then does it again. Also, if ten Bobs assign a dog to Steve, then Alice may need to delete Steve ten times. It sounds like a footgun we don't want to support.
@danieltabacaru the alternative footgun is better? (10 Dog objects that are invisible and orphaned in the database?) Local work that is suddenly vaporized behind the user's back simply because his Internet connection lapsed for a few minutes? Would you accept a version of Microsoft Word that randomly erased your last paragraph?
In reality, this edge case is fairly narrow. It requires a user to be offline and unaware of a delete. If users are all online and sync is functioning properly, deletes propagate to other users' local realms virtually instantly, so it's not as if this change would suddenly make deletes no longer work.
@nirinchev suggested a manual approach to handle the problem above. Why put that work on developers instead of making it a built-in feature of Realm?
@danieltabacaru (As an alternative, I have previously requested an option that forbids writes when sync is offline. There appears to be no interest in that, however, because Realm is "offline first." It's also not bulletproof because sync can appear online but fail.)
The workaround I proposed is impossible to build into the database without knowledge of the business use case. Reference counting relationships is an inherently hard problem to solve in a general-purpose manner in a distributed system - e.g. since relationships are not a first-party feature of Mongo, we'd need to validate every write to the database against a list of tombstones to see if we need to revive anything. Similarly, those tombstones would need to be synchronized locally and somehow merged based on rules that are hard/impossible to encode in the database - e.g. in your audioclip example, we may have a user that revives a filepath after coming back online, but another user may have created the same filepath due to the original having been deleted. Now we'd need to merge those two together, or you'd end up with what appear to be duplicated filepaths. Since the id seems to be randomly generated, we'd need to use the path property, but there's nothing in the schema that would tell us that.
I want to be clear here, we've likely totaled years of discussions on how to do cascading deletes in a distributed system and have so far been unable to come up with an intuitive, easy to explain, and general-purpose approach here. We haven't shipped support for it not because we don't want to cover the case, but rather because we don't believe what we could offer would do a better job than whatever our users can implement with their specific business case in mind.
@nirinchev that's fair! Is there a set of APIs Realm could offer that would help developers here?
For example, could Realm post a notification or call a specific (optional) handler when the sync service propagates a delete that leaves other objects in an "invalid" state? Perhaps something like the live-collection change handlers: "Hey, the owner 'Steve' got deleted and here's this Dog that used to reference Steve. Decide what you want to do with it, if anything."
Just solving the "silently changes behind your back with zero warning" part of the problem would be a huge help. If you tell me about the conflict, I can implement my own logic to resolve it. As it stands, I can't do that because these destroyed relationships appear silently, without warning. I'd basically have to periodically scrub the entire database looking for corruption, like ZFS does.
Perhaps something like the live-collection change handlers: "Hey, the owner 'Steve' got deleted and here's this Dog that used to reference Steve. Decide what you want to do with it, if anything."
Can't you already do that by listening to changes to Owner class, and whenever there is a deletion you query for Dog's referencing it and decide what to do?
@bdkjones Is there anything else we can help you here?