State of the consistency checker
Preface
First of all: sorry for this lengthy report. I have some information to share and some questions, and my current knowledge is just not sufficient to be more precise or to create proper bug reports. I nevertheless hope that this report is helpful.
I have been playing with the consistency checker lately, as proposed in https://github.com/clyso/chorus/pull/87#issuecomment-2931561201
I have already reported #89 and #90 and I stumbled about some more issues.
I realized that I am not 100% sure about the current state of the consistency checker. I neither know what is expected to work, nor if I use the consistency checker in the intended way. So I decided to just report my observations instead of creating bug reports without proper reproducers and without being sure that the issue is really a bug and not a mistake or misconception on my side.
My observations are semi-persistent. I have observed all of these issues several times, but I have not created reliable reproducers yet.
All of my tests currently run on a standalone chorus instance within k8s. I have also observed most of the issues on a local instance, and I am quite sure that k8s is not the problem. I am currently running the code from commit 6400d01. I have pull-requests #83, #84, #85, #86, #87 applied, but none of the patches touches code which looks relevant for the consistency check and I have observed most of the behavior in the upstream version.
I am using a cloudian-s3 product as source and a ceph cluster with radosgw as destinations. The concrete versions are unknown to me, but I don't think that the used storage is relevant here.
No errors were reported in the chorus log.
Observations
False negative check results
- empty bucket "chorustest05" in storage "source" with chorus replication to "chorustest05dst" in storage "destination"
- upload 128 small objects via the chorus proxy
- add a second replication to "chorustest05dst02" in "destination"
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml repl
NAME PROGRESS SIZE OBJECTS EVENTS PAUSED AGE HAS_SWITCH
user1:chorustest05:source->destination:chorustest05dst02 [##########] 100.0 % 722 B/722 B 128/128 0/0 false 2m false
user1:chorustest05:source->destination:chorustest05dst [ ] 0.0 % 0 B/0 B 0/0 128/128 false 51m false
- start consistency checks (all 3 combinations + one check for all three together)
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst -u user1
Consistency check has been created.
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst02 -u user1
Consistency check has been created.
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst destination:chorustest05dst02 -u user1
Consistency check has been created.
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check destination:chorustest05dst destination:chorustest05dst02 -u user1
Consistency check has been created.
- check consistency report
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst -u user1
ID:\t
READY:\ttrue
QUEUED:\t2
COMPLETED:\t2
CONSISTENT:\ttrue
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst02 -u user1
ID:\t
READY:\ttrue
QUEUED:\t2
COMPLETED:\t2
CONSISTENT:\ttrue
root@s3float-01:~/data# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report destination:chorustest05dst destination:chorustest05dst02 -u user1
ID:\t
READY:\ttrue
QUEUED:\t2
COMPLETED:\t2
CONSISTENT:\tfalse
PATH ETAG destination destination
object0000000 69ae5be021087993b62a1ffbdfefd31e ✓ ✓
object0000001 13ba32e623e0b1e706797140a1a662b3 ✓ ✓
object0000002 26e41e2047e361fd54e4d19ec39375eb ✓ ✓
object0000003 e7af1d53d57dfb0780021d0b16b2c57f ✓ ✓
...
Even though chorustest05 is consistent with chorustest05dst and chorustest05dst02, the latter who buckets are reported as being not consistent.
Unfiltered report
In my previous tests, the consistency report just includes the report header if the check was positive and an additional list of inconsistencies if the check was negative. I assume that this is the intended behavior.
Sometimes, as in the example above, the report includes a list of all objects, including the matching ones. I have shortened the output, but the report included all 128 consisten files.
This is not specific to the false-negative reports. It can also be observed when a real inconsistency was provoked. Here I have intentionally removed one object from the destinations, purged the results and re-run the check:
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst destination:chorustest05dst02 -u user1 | head -n15
ID:\t
READY:\ttrue
QUEUED:\t3
COMPLETED:\t3
CONSISTENT:\tfalse
PATH ETAG destination destination source
object0000000 69ae5be021087993b62a1ffbdfefd31e ✓ ✓ ✓
object0000001 13ba32e623e0b1e706797140a1a662b3 ✓ ✓ ✓
object0000002 26e41e2047e361fd54e4d19ec39375eb X X ✓
object0000003 e7af1d53d57dfb0780021d0b16b2c57f ✓ ✓ ✓
object0000004 760f584193478f3ecb3ebc529e0f4835 ✓ ✓ ✓
object0000005 8efac87256f663b7440a331203553c52 ✓ ✓ ✓
object0000006 34628cdef4f0ca3f6cd6aefdf117cb55 ✓ ✓ ✓
object0000007 a8e63bec37ea3991d95c1bdd875418fe ✓ ✓ ✓
...
Information hiding
After provoking a inconsistency in one of the two destinations, the 3-way check reports the buckets as being inconsistent, but hides the information what exactly is wrong. "object0000002" is missing in one of the destinations.
Clean up:
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency purge source:chorustest05 destination:chorustest05dst destination:chorustest05dst02 -u user1 | head -n15
Consistency check source:chorustest05 scheduled for deletion.
Deletion will be completed asynchronously.
Consistency check destination:chorustest05dst scheduled for deletion.
Deletion will be completed asynchronously.
Consistency check destination:chorustest05dst02 scheduled for deletion.
Deletion will be completed asynchronously.
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency purge source:chorustest05 destination:chorustest05dst -u user1 | head -n15
Consistency check source:chorustest05 scheduled for deletion.
Deletion will be completed asynchronously.
Consistency check destination:chorustest05dst scheduled for deletion.
Deletion will be completed asynchronously.
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency purge source:chorustest05 destination:chorustest05dst02 -u user1 | head -n15
Consistency check source:chorustest05 scheduled for deletion.
Deletion will be completed asynchronously.
Consistency check destination:chorustest05dst02 scheduled for deletion.
Deletion will be completed asynchronously.
Re-check:
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst destination:chorustest05dst02 -u user1 | head -n15
Consistency check has been created.
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst02 -u user1 | head -n15
Consistency check has been created.
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest05 destination:chorustest05dst -u user1 | head -n15
Consistency check has been created.
Check Reports:
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst -u user1 | head -n15
ID:\t
READY:\ttrue
QUEUED:\t2
COMPLETED:\t2
CONSISTENT:\ttrue
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst02 -u user1 | head -n15
ID:\t
READY:\ttrue
QUEUED:\t2
COMPLETED:\t2
CONSISTENT:\tfalse
PATH ETAG destination source
object0000002 26e41e2047e361fd54e4d19ec39375eb X ✓
root@s3float-01:~# chorctl -a keel.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest05 destination:chorustest05dst destination:chorustest05dst02 -u user1 | head -n15
ID:\t
READY:\ttrue
QUEUED:\t3
COMPLETED:\t3
CONSISTENT:\tfalse
PATH ETAG destination destination source
object0000000 69ae5be021087993b62a1ffbdfefd31e ✓ ✓ ✓
object0000001 13ba32e623e0b1e706797140a1a662b3 ✓ ✓ ✓
object0000002 26e41e2047e361fd54e4d19ec39375eb ✓ ✓ ✓
object0000003 e7af1d53d57dfb0780021d0b16b2c57f ✓ ✓ ✓
object0000004 760f584193478f3ecb3ebc529e0f4835 ✓ ✓ ✓
object0000005 8efac87256f663b7440a331203553c52 ✓ ✓ ✓
object0000006 34628cdef4f0ca3f6cd6aefdf117cb55 ✓ ✓ ✓
object0000007 a8e63bec37ea3991d95c1bdd875418fe ✓ ✓ ✓
...
The relevant information (details about the inconsistency) is missing in the 3-way report.
False positives
Even worse, the consistency check sometimes reports false-positives if a check is re-run without purge. I have never observed this in a flat bucket, but if a inconsistency is provoked at the leave of a tree in a not-too-small bucket. (I don't know when exactly the misbehavior starts, my current testbucket stores about 12g in round about 1600 objects, distributed over 240 directories.)
Start replication and check consistency, the report results in consistent: true
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml repl
NAME PROGRESS SIZE OBJECTS EVENTS PAUSED AGE HAS_SWITCH
user1:chorustest01:source->destination:chorustest01dst [##########] 100.0 % 8.8 GiB/8.8 GiB 755/755 0/0 false 23h55m false
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest01 destination:chorustest01dst -u user1
ID:\t
READY:\tfalse
QUEUED:\t0
COMPLETED:\t0
CONSISTENT:\tfalse
PATH ETAG
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest01 destination:chorustest01dst -u user1
Consistency check has been created.
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest01 destination:chorustest01dst -u user1 | head
ID:\t
READY:\ttrue
QUEUED:\t256
COMPLETED:\t256
CONSISTENT:\ttrue
Delete one object from destination and re-run the check:
root@s3float-01:~/data# rclone delete destination:chorustest01dst/tree04/03/file_new_120
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency check source:chorustest01 destination:chorustest01dst -u user1
Consistency check has been created.
root@s3float-01:~/data# chorctl -a anarki.mgmt.k8s.s3float:443 --config /root/blatter/chorctl.yaml consistency report source:chorustest01 destination:chorustest01dst -u user1 | head
ID:\t
READY:\ttrue
QUEUED:\t362
COMPLETED:\t362
CONSISTENT:\ttrue
The check reports both buckets as consistent.
Another observation is, that only additional 106 thingies have been queued and processed, instead of 256 in the first run. (What is this number in the first place? Number of recursing directories? I did not figure it out)
rclone check reports the inconsistency as expected:
root@s3float-01:~/data# rclone check source:chorustest01 destination:chorustest01dst
2025/06/12 11:17:06 ERROR : tree04/03/file_new_120: file not in S3 bucket chorustest01dst
2025/06/12 11:17:06 NOTICE: S3 bucket chorustest01dst: 1 files missing
2025/06/12 11:17:06 NOTICE: S3 bucket chorustest01dst: 1 differences found
2025/06/12 11:17:06 NOTICE: S3 bucket chorustest01dst: 1 errors while checking
2025/06/12 11:17:06 NOTICE: S3 bucket chorustest01dst: 754 matching files
2025/06/12 11:17:06 NOTICE: Failed to check: 1 differences found
I think that a proper consistency check should be able to handle re-running a test without purging the results. In an ideal world, the old check would be preserved and it would be possible to list and read all results which have not yet been purged.
Even if re-checking without purge this is not the intended usage, I think that the API should warn or even fail if a user starts a conflicting check. Accepting the re-check and reporting a false-positive not an option in my opinion.
Questions
What is the exact meaning of the queued and completed values in the report?
Do you agree that my observations indicate that something is wrong with the consistency checker, or am I using the checker completely wrong?
If I am using it wrong: Can I find some documentation about the intended usage of the consistency checker?
If I am not using it wrong: Would you consider picking my fixed chorctl check from #87?
I am aware of the extensive resource usage of rclone check. But if enough memory and CPU is available, it is doing its job. I don't want to be harsh, but from my current point of view, the new consistency check API is not a real alternative.
Do you agree that my observations indicate that something is wrong with the consistency checker, or am I using the checker completely wrong?
Consistency checker has been introduced to the project not a long time ago. We haven't received any feedback on it before, it has some TODOs, and, indeed, it may have some flaws. Especially its CLI. At the moment input validation is pretty minimal on its side.
What is the exact meaning of the queued and completed values in the report?
That is an amount of bucket traversal jobs. You can check how they are scheduled in consistency_check_handlers.go.
IncrementConsistencyCheckScheduledCounter call is happening before each new job is put to the queue.
If I am using it wrong: Can I find some documentation about the intended usage of the consistency checker?
It was not assumed that one would need to run multiple checks to the same bucket at once. Other than that, it looks the way it should be used.
I think that a proper consistency check should be able to handle re-running a test without purging the results. In an ideal world, the old check would be preserved and it would be possible to list and read all results which have not yet been purged.
It can be implemented, and that is relatively easy to. To achieve this, we should reconsider the use of generated IDs for consistency jobs instead of composite IDs. @arttor wdyt?
In fact, one of the reasons that can provoke errors in your case may be the use of composite IDs, since ID is formed from concatenated list of locations to validate. What you can try to do, to quickly check it, is to replace colon delimiters here with other symbol, e.g. underscore.
My observations are semi-persistent. I have observed all of these issues several times, but I have not created reliable reproducers yet.
If there would be something else that can help to look into the matter please let me know.
To achieve this, we should reconsider the use of generated IDs for consistency jobs instead of composite IDs. @arttor wdyt?
We can use generated IDs if it is more convenient. I personally find composite IDs better for 3 reasons:
- there is not much value in check history if it is not linked to particular bucket state. but bucket always has only current state.
- using random id in CLI commands is less convenient than using bucket/storage names to get results
- redis memory may be limited so it is better to not store historical data there
But we can change to generated id if listed issues are not really important/valid. I also used consistency check feature only a couple of times.
It was not assumed that one would need to run multiple checks to the same bucket at once.
I think this feature is not really important, but it might be handy to see the results of previous tests. The important thing for me is: if the API accepts check requests if old results are present, the report should be correct. IMO, it is totally fine to either reject the second check request or to automatically discard the previous check. But reporting a false positive is bad.
Side note: would be cool to invalidate check or indicate that bucket was changed after consistency check was started. Unfortunately, S3 don't return Last-Modifed in HeadBucket response. But we can invalidate checks if bucket was changed with chorus Proxy. Don't know if it is useful but it could be implemented in chorus.