Confusing "operation with the given volume ID" errors

Open BlaineEXE opened this issue 3 years ago • 9 comments

Describe the bug

Ceph-CSI provisioner containers often output errors like below. To my understanding, this does not constitute a real error, as it only means that another Ceph-CSI process is handling the request. It often confuses Rook users as here https://github.com/rook/rook/issues/7202#issuecomment-780099804. Can these errors be reduced to Info messages?

E0216 20:15:38.416004       1 utils.go:136] ID: 1096 Req-ID: pvc-7f8ea9b2-dbe2-4cee-b6fa-0f5f71705444 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-7f8ea9b2-dbe2-4cee-b6fa-0f5f71705444 already exists
E0216 20:16:40.427998       1 controllerserver.go:239] ID: 1098 Req-ID: pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 an operation with the given Volume ID pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 already exists
E0216 20:16:40.428033       1 utils.go:136] ID: 1098 Req-ID: pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-1ac6f22b-2e19-4ead-8330-31bee8c5af16 already exists
E0216 20:17:41.897098       1 controllerserver.go:239] ID: 1100 Req-ID: pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c an operation with the given Volume ID pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c already exists
E0216 20:17:41.897134       1 utils.go:136] ID: 1100 Req-ID: pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID pvc-0d025ac3-1daa-47f0-a0da-aed85f501b2c already exists

Feb 18 '21 20:02 BlaineEXE

@BlaineEXE I am diplomatic here. In one way its correct that, its informational message and can be logged as Info. In other angle, its a request reiteration from controller in above layer like sidecar and not really a retry from CSI driver layer. So I would like to get general opinion (@ShyamsundarR @nixpanic @Madhu-1 ) here before we make a move.

Feb 19 '21 05:02 humblec

I agree that it is a confusing message. However, it usually is caused by an earlier failure that prevented the volume from getting created. We may be able to include more details about the original request, so that users have something they can lookup to troubleshoot a little more.

Unfortunately, I have seen occasions where this error is reported, because the initial CreateVolume request never returned. In these cases, the error message is actually correct. (Can happen if the RBD pool is configured correctly, like too few PGs.)

I wonder if we can detect actual hangs, and see if the go-routine is still active or not. If it is not, we should cleanup the journal so that a next try to create the volume can succeed.

Feb 19 '21 14:02 nixpanic

I agree that it is a confusing message. However, it usually is caused by an earlier failure that prevented the volume from getting created. We may be able to include more details about the original request, so that users have something they can lookup to troubleshoot a little more.

I think this would help quite a bit if it's possible to include more information about where the root of the issue is. It's difficult when users merely report this error line, because it doesn't actually contain any information relevant to finding the source of the issue. Even just more text explaining that this could be a side-effect of another issue would be good and where (if possible) users should look for a different error.

Feb 19 '21 18:02 BlaineEXE

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Jul 21 '21 04:07 stale[bot]

We still get questions about this issue in Rook. Is it possible to get more information reported in these messages that can help narrow down the source of these issues when they occur?

Aug 19 '21 15:08 BlaineEXE

@Yuggupta27 @Rakshith-R PTAL

Aug 20 '21 06:08 Madhu-1

Sure, We can definitely narrow the returned error which will help in reducing the noise and also the confusion. Will look into this :+1:

Aug 20 '21 06:08 Yuggupta27

Hello, folks. We faced the same problem debugging the rbd plugin. Executing the rbd command from the provisioner pod with the given credentials had worked, but we were still seeing the "already exists" error in the CSI provisioner logs.

Frankly, we did not manage to reach the real cause of the issue, because suddenly the provisioner started working normally. I am writing to just bump the thread one more time. It is still hard to debug such errors.

Nov 15 '21 20:11 nabokihms

Maybe a good start would be for the provisioner to "ping" the monitors in a way that is visible to the user?

Because at the moment a network misconfiguration is extremely easy to miss (first CreateVolume GRPC request timing out) and followed by a ton of noise (retries being aborted as per this issue)

Jul 28 '22 10:07 Tristan971

It will be very difficult to get to the source of the problem as we don't keep track of what is the exact step going on in each RPC call.

Removing keepalive to get some attention if possible

Nov 03 '23 09:11 Madhu-1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Dec 03 '23 21:12 github-actions[bot]

Still think that even just a warning saying that « mons are unreachable [ip1, ip2, ip3] » in logs (and maybe even k8s resource events?) would go a long way into making this less confusing

Dec 04 '23 08:12 Tristan971

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Jan 03 '24 21:01 github-actions[bot]

Still think that even just a warning saying that « mons are unreachable [ip1, ip2, ip3] » in logs (and maybe even k8s resource events?) would go a long way into making this less confusing

@Tristan971 its not always the mon reachable problem, there could be many others reasons as well, if we don't receive any response to the ceph commands csi ran we will see this error message as well.

Jan 11 '24 12:01 Madhu-1

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Feb 11 '24 21:02 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Feb 18 '24 21:02 github-actions[bot]

ceph-csi ceph-csi copied to clipboard

Confusing "operation with the given volume ID" errors

Describe the bug

ceph-csi
ceph-csi copied to clipboard