Inconsistency in backup in active-active replica cluster
I did this
I have two replicas in my current cluster. One is marked as primary and in case of inconsistencies, the secondary sync automatically with it. Both are configured in mutual-pull mode.
I had a problem with the database and I had to restore it. I downscale to one replica and restore that data, and create a new replica to sync with it.
I expected the following
Everything worked fine.
This happened instead
Replication was broken and the replica could not sync with my primary node because there were some inconsistencies in the database.
I ran the kanidmd database verify command and I got three error lines like these:
ERROR ๐จ [error]: Err(RefintNotUpheld(294))
ERROR ๐จ [error]: Err(RefintNotUpheld(294))
ERROR ๐จ [error]: Err(RefintNotUpheld(221))
One point to note is that I believe this command previously worked without any errors.
After checking the entries from sqlite database, I saw that there were 2 from one person and the other one was idm_all_accounts dynamic group.
I finally fixed it in an ugly way. idm_all_accounts had one UID that was not present in the database. I checked with kanidm raw search, kanidm person get, kanidm service-account get... It was not easy to delete. I couldn't get it with kanidm raw modify, probably because of my lack of knowledge.
Also, investigating the person who was given the errors, I didn't see anything special there. Groups UIDs existed in the database and so on.
Finally, the ugly workaround was to restore again from the backup but removing the UID that was failing in dynmember attribute from idm_all_accounts and removing all the oauth2 sessions and tokens from the corrupted person entity.
After the restore with those manual modifications, everything worked and I was able to run my second replica and sync it with the cluster.
Kanidm version details
- Output of
kanidm(d) version: 1.7.4 - Are you running it in a container? If so, which image/tag?: 1.7.4
- If not a container, how'd you install it:
- Operating System / Version (On Unix please post the output of
uname -a):
Any other comments
A bit off-topic but related: I would be delighted to work on improving any aspect related to replication. We have identified the lack of readiness probe, the consistency of backups could be an issue, a convenient way to perform vacuum in HA environments, online restores...
I would need some guidelines on how to implement them, but here I am.
There are two separate problems here.
Replication was broken and the replica could not sync with my primary node because there were some inconsistencies in the database.
Without seeing what was in the RUV it's hard to determine what the fault was. Normally the error tells you about what happened and if a node is advanced/lagged and that can help.
I ran the
kanidmd database verifycommand and I got three error lines like these:ERROR ๐จ [error]: Err(RefintNotUpheld(294)) ERROR ๐จ [error]: Err(RefintNotUpheld(294)) ERROR ๐จ [error]: Err(RefintNotUpheld(221))
This is due to an error in refint verify incorrectly including tombstones/conflict entries. In this case, you can safely ignore this error.
One point to note is that I believe this command previously worked without any errors.
After checking the entries from sqlite database, I saw that there were 2 from one person and the other one was
idm_all_accountsdynamic group.I finally fixed it in an ugly way.
idm_all_accountshad one UID that was not present in the database. I checked withkanidm raw search,kanidm person get,kanidm service-account get... It was not easy to delete. I couldn't get it withkanidm raw modify, probably because of my lack of knowledge.
These would have been conflict entries - they can be ignored.
Any other comments
A bit off-topic but related: I would be delighted to work on improving any aspect related to replication. We have identified the lack of readiness probe,
This one is hard to add, and will require a lot of in depth work to add to the status check.
the consistency of backups could be an issue,
The current backup json files already include all replication metadata required for a restore. It was designed so that if you have two nodes A, and B, and you restore B, even though you restored it will automatically "catch up" to A.
a convenient way to perform vacuum in HA environments,
The "vacuum" command today is only there to perform a vacuum of the underlying storage engine (which in turn influences page size of the DB). You almost never need to run this command, so I think this isn't important to add.
online restores...
Probably won't add online restores - restores are for a disaster, not something we want to be doing all the time ....
I would need some guidelines on how to implement them, but here I am.
I'll have to send you some info about how replication works to help then :)
A bit rough, but this explains replication and how it works https://github.com/kanidm/kanidm/blob/master/book/src/developers/designs/replication_design_and_notes.md
This is due to an error in refint verify incorrectly including tombstones/conflict entries. In this case, you can safely ignore this error.
Actually, looks like it's not an error in the verify code. I'd need to see the affected entries to work out more about why this is happening, as we should be trimming references even on conflicted entries.
Without seeing what was in the RUV it's hard to determine what the fault was. Normally the error tells you about what happened and if a node is advanced/lagged and that can help.
That's easy, let's check the logs:
1762516107057 2025-11-07T11:48:27.057Z d56b12f0-6df0-426d-8e2b-bd2fe0c9ce28 INFO request [ 24.4ยตs | 100.00% ] method: GET | uri: /status | version: HTTP/1.1
1762516107057 2025-11-07T11:48:27.057Z d56b12f0-6df0-426d-8e2b-bd2fe0c9ce28 INFO โโ ๏ฝ [info]: | connection_addr: 10.0.4.252:56912 | client_ip_addr: 10.0.4.252
1762516107726 2025-11-07T11:48:27.726Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO ๏ฝ IMMEDIATE ๏ฝ repl_run_consumer > consumer_apply_refresh_v1 > reindex > System reindex: started - this may take a long time!
1762516107749 2025-11-07T11:48:27.749Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO ๏ฝ IMMEDIATE ๏ฝ repl_run_consumer > consumer_apply_refresh_v1 > reindex > System reindex: started - this may take a long time!
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO repl_run_consumer [ 114ms | 38.12% / 100.00% ] sock_addrs: [10.0.1.78:8444] | automatic_refresh: true | eventid: "1096f3ea-c6a7-45ad-94b8-dcb0818b3d03" | server_name: kanidm-default-0.kanidm
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: Unable to proceed with consumer incremental - the supplier has indicated that our domain_uuid's are not equivalent. This can occur when adding a new consumer to an existing topology.
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: This server's content must be refreshed to proceed. If you have configured automatic refresh, this will occur shortly.
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 WARN โโ ๐ง [warn]: Consumer is out of date and must be refreshed. This will happen *now*.
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โโ consumer_apply_refresh_v1 [ 70.5ms | 40.71% / 61.88% ]
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โโ reindex [ 7.40ms | 6.50% ]
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: started - this may take a long time!
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Reindexed 0 entries
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: started
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: complete โ
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: started
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: complete โ
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: complete ๐
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โโ reindex [ 16.7ms | 14.68% ]
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: started - this may take a long time!
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Reindexed 0 entries
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: started
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: complete โ
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: started
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: complete โ
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: complete ๐
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 WARN โ โโ ๐ง [warn]: Using domain uuid from the database idm.grigri.cloud - was idm.grigri.cloud in memory | event_tag_id: 2
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: some uuids that were referenced in this operation do not exist.
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: | missing: b2646e11-d7a9-483f-93b2-3d36dda941c7
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: Refresh operation failed (post_repl_refresh plugin), Plugin(ReferentialIntegrity("Uuid referenced not found in database")) | event_tag_id: 1
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: Failed to refresh schema entries | err: Plugin(ReferentialIntegrity("Uuid referenced not found in database"))
1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: consumer was not able to apply refresh. | err: Plugin(ReferentialIntegrity("Uuid referenced not found in database"))
This is due to an error in refint verify incorrectly including tombstones/conflict entries. In this case, you can safely ignore this error.
Ok, so if I understand correct, that is the bug. The check is incorrect and that affects replication and kanidmd database verify commands.
These would have been conflict entries - they can be ignored.
I had to clean them to fix the problem. What other alternatives are there?
This one is hard to add, and will require a lot of in depth work to add to the status check.
Agree. Indeed, thinking deeply about that topic I don't know if there is much improvement without adding some kind of heartbeat mechanism between replicas.
The current backup json files already include all replication metadata required for a restore. It was designed so that if you have two nodes A, and B, and you restore B, even though you restored it will automatically "catch up" to A.
Got it, it is a bug with the database checks. In fact, my data was completely fine and unaffected.
The "vacuum" command today is only there to perform a vacuum of the underlying storage engine (which in turn influences page size of the DB). You almost never need to run this command, so I think this isn't important to add.
Ah, OK. Good to know. It was not clear to me how important it was.
Probably won't add online restores - restores are for a disaster, not something we want to be doing all the time ....
I'm not so sure here. I was thinking that the IDP more important property is availability and we have eventual consistency, so restoring for a previous backup in place (imagine that you deleted some entries that affect just a group of users) will make those kind of operations easier in a complex deployment.
I'll have to send you some info about how replication works to help then :)
I know that this is a complex topic, and I'm not sure about the best architecture, I didn't think too much about. Something like incorporating a coordinator/arbiter/witness could elevate the design to the next level in terms of resilience and ease of deployment. The current architecture is solid, but I'm missing some functionality to operate this in a standard way in high availability environments.
And I'm not talking about a coordinator who just renews the certificates. That's easy to implement, but more on the side of proxying requests to the right instances when instances are lagging or there is a split brain. There are complex scenarios that all not fully covered with the current functionality.
Probably, this is not a priority right now but it is a topic where I could contribute.
That's easy, let's check the logs:
1762516107057 2025-11-07T11:48:27.057Z d56b12f0-6df0-426d-8e2b-bd2fe0c9ce28 INFO request [ 24.4ยตs | 100.00% ] method: GET | uri: /status | version: HTTP/1.1 1762516107057 2025-11-07T11:48:27.057Z d56b12f0-6df0-426d-8e2b-bd2fe0c9ce28 INFO โโ ๏ฝ [info]: | connection_addr: 10.0.4.252:56912 | client_ip_addr: 10.0.4.252 1762516107726 2025-11-07T11:48:27.726Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO ๏ฝ IMMEDIATE ๏ฝ repl_run_consumer > consumer_apply_refresh_v1 > reindex > System reindex: started - this may take a long time! 1762516107749 2025-11-07T11:48:27.749Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO ๏ฝ IMMEDIATE ๏ฝ repl_run_consumer > consumer_apply_refresh_v1 > reindex > System reindex: started - this may take a long time! 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO repl_run_consumer [ 114ms | 38.12% / 100.00% ] sock_addrs: [10.0.1.78:8444] | automatic_refresh: true | eventid: "1096f3ea-c6a7-45ad-94b8-dcb0818b3d03" | server_name: kanidm-default-0.kanidm 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: Unable to proceed with consumer incremental - the supplier has indicated that our domain_uuid's are not equivalent. This can occur when adding a new consumer to an existing topology. 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: This server's content must be refreshed to proceed. If you have configured automatic refresh, this will occur shortly. 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 WARN โโ ๐ง [warn]: Consumer is out of date and must be refreshed. This will happen *now*. 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โโ consumer_apply_refresh_v1 [ 70.5ms | 40.71% / 61.88% ] 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โโ reindex [ 7.40ms | 6.50% ] 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: started - this may take a long time! 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Reindexed 0 entries 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: started 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: complete โ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: started 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: complete โ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: complete ๐ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โโ reindex [ 16.7ms | 14.68% ] 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: started - this may take a long time! 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Reindexed 0 entries 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: started 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Optimising Indexes: complete โ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: started 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: Calculating Index Optimisation Slopes: complete โ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 INFO โ โ โโ ๏ฝ [info]: System reindex: complete ๐ 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 WARN โ โโ ๐ง [warn]: Using domain uuid from the database idm.grigri.cloud - was idm.grigri.cloud in memory | event_tag_id: 2 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: some uuids that were referenced in this operation do not exist. 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: | missing: b2646e11-d7a9-483f-93b2-3d36dda941c7 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: Refresh operation failed (post_repl_refresh plugin), Plugin(ReferentialIntegrity("Uuid referenced not found in database")) | event_tag_id: 1 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โ โโ ๐จ [error]: Failed to refresh schema entries | err: Plugin(ReferentialIntegrity("Uuid referenced not found in database")) 1762516107784 2025-11-07T11:48:27.784Z 9710f0bb-ff92-44cd-8df1-c6e356ef17a2 ERROR โโ ๐จ [error]: consumer was not able to apply refresh. | err: Plugin(ReferentialIntegrity("Uuid referenced not found in database"))
So this is telling you that there is a data issue in the incoming data. So on the source server, you need to investigate that to determine why there was a refint error there. I'm worried something is "off" here, as refint is really strict in how it works, so the fact an incoming server has a potential data issue is a big concern to me.
This is due to an error in refint verify incorrectly including tombstones/conflict entries. In this case, you can safely ignore this error.
Ok, so if I understand correct, that is the bug. The check is incorrect and that affects replication and
kanidmd database verifycommands.These would have been conflict entries - they can be ignored.
I had to clean them to fix the problem. What other alternatives are there?
Leave them be, they are automatically removed in future. Alternately there is a hidden "db quarantine" command that can isolate entries, but it can have uhhh ... side effects. I would normally advise that you just leave conflict entries alone as they are automatically removed in time.
EDIT: In time meaning after 7 days by default.
This one is hard to add, and will require a lot of in depth work to add to the status check.
Agree. Indeed, thinking deeply about that topic I don't know if there is much improvement without adding some kind of heartbeat mechanism between replicas.
There already is a heartbeat between them, it's more about reporting up that status to the status handler in a way that is actionable.
Probably won't add online restores - restores are for a disaster, not something we want to be doing all the time ....
I'm not so sure here. I was thinking that the IDP more important property is availability and we have eventual consistency, so restoring for a previous backup in place (imagine that you deleted some entries that affect just a group of users) will make those kind of operations easier in a complex deployment.
You know Kanidm has a recycle bin right? So you can restore entries that you deleted :)
Plus you have another issue.
Lets say you had two nodes A and B. You delete these important entries on server A. Then replication between A -> B occurs, propagating that delete.
You realise and now thing "oh no" and restore A from backup. What happens? B will replicate back to A the delete of the entries you are missing.
So in this case, you would be restoring A and then having to reset every server in your topology to restore the data. That's a hugely invasive process and not one that would be uhhh fun to undertake.
I think the recyclebin may be a bit easier here.
I'll have to send you some info about how replication works to help then :)
I know that this is a complex topic, and I'm not sure about the best architecture, I didn't think too much about. Something like incorporating a coordinator/arbiter/witness could elevate the design to the next level in terms of resilience and ease of deployment. The current architecture is solid, but I'm missing some functionality to operate this in a standard way in high availability environments.
Yes, that's been a goal for a while, but it's been shelved for now just due to other priorities.
And I'm not talking about a coordinator who just renews the certificates. That's easy to implement, but more on the side of proxying requests to the right instances when instances are lagging or there is a split brain. There are complex scenarios that all not fully covered with the current functionality.
Probably, this is not a priority right now but it is a topic where I could contribute.
I think understanding how replication in Kanidm works first will be important, as Kanidm doesn't really get split brain the same as as other systems do. You also don't need to do anything to proxy requests, there is already support in the replication mechanism that indirect nodes can replicate to either other via connected ones. e.g. A <-> B <-> C will be consistent and A and C will get all of each other's changes via B.
The concept of "lag" and "advanced" on nodes indicates when you have a node that's been disconnected for more than 7 days. If nodes are disconnected for a few hours, they automatically catch up. Heck you can stop a node for 5 days, turn it back on and it'll come back no problem. So again, you don't need to proxy anything or do anything, you just need to restore connectivity so that your node can see at least one other node. Then everything will work.
So I think perhaps understanding more about how replication in Kanidm works will help you, because the problems- you are thinking of are chronic to other systems, but not Kanidm. Our replication is far closer to Active Directory or 389-ds - we are not like pgsql or mysql in that way.
So this is telling you that there is a data issue in the incoming data. So on the source server, you need to investigate that to determine why there was a refint error there. I'm worried something is "off" here, as refint is really strict in how it works, so the fact an incoming server has a potential data issue is a big concern to me.
I understand that the refint error is the RefintNotUpheld errors that we saw in the database verification. So, my question here is why I ended with those invalid references and why are they are a problem. If you said that it is just a problem of verification because we are taking into account tombstones then we could fix that and everything should be working.
Leave them be, they are automatically removed in future. Alternately there is a hidden "db quarantine" command that can isolate entries, but it can have uhhh ... side effects. I would normally advise that you just leave conflict entries alone as they are automatically removed in time. EDIT: In time meaning after 7 days by default.
If I did that, I assume that my replication will be broken until it is fixed and taking into account that I have a load balancer of top of the replicas, it will mean that half of the clients will find an empty database. I don't fully understand your point because the key for me is that everything should work as expected and I expect that replication works and upgrade check pass for being able to upgrade the cluster.
You know Kanidm has a recycle bin right? So you can restore entries that you deleted :)
Right, I've never used but I will have to investigate it further because in my case the problem was that I deleted all the groups and then recreated with wrong values.
About the replication topic probably we could move the discussion to matrix to keep the focus on the issue. Now I'm understanding better the implications of the RUV and eventual consistency.
So this is telling you that there is a data issue in the incoming data. So on the source server, you need to investigate that to determine why there was a refint error there. I'm worried something is "off" here, as refint is really strict in how it works, so the fact an incoming server has a potential data issue is a big concern to me.
I understand that the refint error is the
RefintNotUphelderrors that we saw in the database verification. So, my question here is why I ended with those invalid references and why are they are a problem. If you said that it is just a problem of verification because we are taking into account tombstones then we could fix that and everything should be working.
Refint over replication might be one of the most complex topics in the server. I reviewed the case and it's actually a flaw on the incoming data - you can tell that in the log, it's telling you "hey this isn't right, the incoming data has a consistency flaw that should be impossible". That's why it stops.
So we need to look at the supplier where you are refreshing-from. That way we can work out why refint is unhappy on that side. We need to see the other side to understand what entries are in that state.
Leave them be, they are automatically removed in future. Alternately there is a hidden "db quarantine" command that can isolate entries, but it can have uhhh ... side effects. I would normally advise that you just leave conflict entries alone as they are automatically removed in time. EDIT: In time meaning after 7 days by default.
If I did that, I assume that my replication will be broken until it is fixed and taking into account that I have a load balancer of top of the replicas, it will mean that half of the clients will find an empty database. I don't fully understand your point because the key for me is that everything should work as expected and I expect that replication works and upgrade check pass for being able to upgrade the cluster.
No that's not the case. Conflict entries are created on ... well conflicts. Conflict entries are a trade that causes the entry to be "deleted" but in a way you could recover it if you so choose. During normal operation, conflict entries if they are created do not harm or impact replication - in fact they are required so replication can keep working!!!
As two simple (and the world simple here is doing overtime in that sentence ...) examples. Imagine on server A you create an entry with UUID=X. Then on server B you create an entry with the same UUID=X. Now replication happens. Which one should the server keep as truth? UUID's are the key replication identifier, and we can't keep both.
In this case, we take the one from Server A because it was created first. Then the version from B becomes a conflict entry
Now what about two people who both changes their usernames to the same thing? We require unique usernames in Kanidm, so what now? Well, due to reasons that are somewhat complex, we don't know who had the username first and who by rights it should belong to. As a result we put both entries into the conflict state. Especially since this could be an attempt at name poaching and account theft.
Both of these are exceptionally rare cases, but that's the hard part of replication is these near-impossible corner cases. These are the most important thing to handle.
You know Kanidm has a recycle bin right? So you can restore entries that you deleted :)
Right, I've never used but I will have to investigate it further because in my case the problem was that I deleted all the groups and then recreated with wrong values.
recycle bin is "best effort" and one thing that is affected in that process can be group memberships. It is however normally very good at restoring memberships if possible but that isn't always true.
Now for reference, conflict entries are put into the recycle bin with an extra marker. So you can restore conflicted entries out of the recyclebin in these cases.
So yeah, recycle bin is likely better than "restore from backup" due to the reasons I mentioned.
About the replication topic probably we could move the discussion to matrix to keep the focus on the issue. Now I'm understanding better the implications of the RUV and eventual consistency.
It's a big topic :) seriously you could probably get a phd for it.
Let's try to focus on the problem again from scratch:
- Run a cluster with 2 replicas in active-active mode and use both of them.
- Do regular online backups with the backup schedule option.
- Restore one of those backups.
I tried again with the backup from yesterday and it is even worse, Kanidm is not able to start:
00000000-0000-0000-0000-000000000000 INFO ๏ฝ [info]: Starting kanidm with configuration: address: 0.0.0.0:8443, domain: idm.grigri.cloud, ldap address: disabled, origin: https://idm.grigri.cloud/ admin bind p
00000000-0000-0000-0000-000000000000 ERROR ๐จ [error]: from_dbentry failed | e: InvalidValueState
00000000-0000-0000-0000-000000000000 ERROR ๐จ [error]: get_identry failed | event_tag_id: 1 | e: CorruptedEntry(171)
00000000-0000-0000-0000-000000000000 ERROR ๐จ [error]: Failed to reload ruv | event_tag_id: 1 | e: CorruptedEntry(171)
00000000-0000-0000-0000-000000000000 ERROR ๐จ [error]: Failed to setup BE -> CorruptedEntry(171)
00000000-0000-0000-0000-000000000000 ERROR ๐จ [error]: Failed to start server core!
Logging pipeline completed shutdown
This is informing you that the backup you restored from is corrupted in some manner.
You may find it far easier to just setup the new server and refresh it from the other replica.