ceph-csi
ceph-csi copied to clipboard
Confusing "operation with the given volume ID" errors
Describe the bug
Ceph-CSI provisioner containers often output errors like below. To my understanding, this does not constitute a real error, as it only means that another Ceph-CSI process is handling the request. It often confuses Rook users as here https://github.com/rook/rook/issues/7202#issuecomment-780099804. Can these errors be reduced to Info messages?
E0216 20:15:38.416004 1 utils.go:136] ID: 1096 Req-ID: pvc-7f8ea9b2-dbe2-4cee-b6fa-0f5f71705444 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-7f8ea9b2-dbe2-4cee-b6fa-0f5f71705444 already exists
E0216 20:16:40.427998 1 controllerserver.go:239] ID: 1098 Req-ID: pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 an operation with the given Volume ID pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 already exists
E0216 20:16:40.428033 1 utils.go:136] ID: 1098 Req-ID: pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 already exists
E0216 20:17:41.897098 1 controllerserver.go:239] ID: 1100 Req-ID: pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c an operation with the given Volume ID pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c already exists
E0216 20:17:41.897134 1 utils.go:136] ID: 1100 Req-ID: pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c already exists
@BlaineEXE I am diplomatic here. In one way its correct that, its informational message and can be logged as Info. In other angle, its a request reiteration from controller in above layer like sidecar and not really a retry from CSI driver layer. So I would like to get general opinion (@ShyamsundarR @nixpanic @Madhu-1 ) here before we make a move.
I agree that it is a confusing message. However, it usually is caused by an earlier failure that prevented the volume from getting created. We may be able to include more details about the original request, so that users have something they can lookup to troubleshoot a little more.
Unfortunately, I have seen occasions where this error is reported, because the initial CreateVolume
request never returned. In these cases, the error message is actually correct. (Can happen if the RBD pool is configured correctly, like too few PGs.)
I wonder if we can detect actual hangs, and see if the go-routine is still active or not. If it is not, we should cleanup the journal so that a next try to create the volume can succeed.
I agree that it is a confusing message. However, it usually is caused by an earlier failure that prevented the volume from getting created. We may be able to include more details about the original request, so that users have something they can lookup to troubleshoot a little more.
I think this would help quite a bit if it's possible to include more information about where the root of the issue is. It's difficult when users merely report this error line, because it doesn't actually contain any information relevant to finding the source of the issue. Even just more text explaining that this could be a side-effect of another issue would be good and where (if possible) users should look for a different error.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
We still get questions about this issue in Rook. Is it possible to get more information reported in these messages that can help narrow down the source of these issues when they occur?
@Yuggupta27 @Rakshith-R PTAL
Sure, We can definitely narrow the returned error which will help in reducing the noise and also the confusion. Will look into this :+1:
Hello, folks. We faced the same problem debugging the rbd plugin. Executing the rbd command from the provisioner pod with the given credentials had worked, but we were still seeing the "already exists" error in the CSI provisioner logs.
Frankly, we did not manage to reach the real cause of the issue, because suddenly the provisioner started working normally. I am writing to just bump the thread one more time. It is still hard to debug such errors.
Maybe a good start would be for the provisioner to "ping" the monitors in a way that is visible to the user?
Because at the moment a network misconfiguration is extremely easy to miss (first CreateVolume GRPC request timing out) and followed by a ton of noise (retries being aborted as per this issue)
It will be very difficult to get to the source of the problem as we don't keep track of what is the exact step going on in each RPC call.
Removing keepalive to get some attention if possible
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
Still think that even just a warning saying that « mons are unreachable [ip1, ip2, ip3] » in logs (and maybe even k8s resource events?) would go a long way into making this less confusing
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
Still think that even just a warning saying that « mons are unreachable [ip1, ip2, ip3] » in logs (and maybe even k8s resource events?) would go a long way into making this less confusing
@Tristan971 its not always the mon reachable problem, there could be many others reasons as well, if we don't receive any response to the ceph commands csi ran we will see this error message as well.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.