external-attacher Emit events on detach errors

I think kubectl describe or the new kubectl get events provide a great way to quickly see any issues with a cluster. The external-attacher errors are IMO somewhat hidden in the VA status and it would be nice to emit events on PVC's to increase visibility. I would be happy to provide a patch if that is ok.

Jun 21 '22 13:06 avorima

I think volume attachment errors are available in kubectl describe pod (events on pods, not on PVCs).

Jun 22 '22 11:06 jsafrane

The errors on the pod are not very detailed AFAIK, just some generic "failed" message. The attacher would able to publish the actual CSI errors for example.

Jun 22 '22 11:06 avorima

In Pod events I can see: Warning FailedAttachVolume 0s attachdetach-controller AttachVolume.Attach failed for volume "pvc-510f4944-e913-4af8-b20b-3e96ee7be428" : rpc error: code = Internal desc = unknown Attach error: failed when waiting for zonal op: operation operation-1655453091238-5e1a0357213a9-e777d0de-01bc9636 failed (RESOURCE_IN_USE_BY_ANOTHER_RESOURCE): The disk resource 'projects/xxx/zones/us-east1-d/disks/pvc-510f4944-e913-4af8-b20b-3e96ee7be428' is already being used by 'projects/xxx/zones/us-east1-d/instances/jsafrane-1-tlsxz-master-2'

Where the part starting with rpc error: is actually the error from VolumeAttachment.Status.AttachError

Jun 22 '22 12:06 jsafrane

Let me see if I can reproduce my issue. The errors that I was looking for were detach errors and IIRC they were only present on the VA.

Jun 22 '22 13:06 avorima

Indeed, detach errors are different - the pod that would be linked to the events already does not exists.I am not sure if it's correct to link them to a PVC, because in-line volumes do not have PVC.

Jun 22 '22 13:06 jsafrane

Ah ok, then my issue title was wrong. It's only about detach errors. Wouldn't it be alright if the event linking only happens when it's not an in-line volume?

Jun 22 '22 13:06 avorima

I find it odd that attach errors are sent to pods and detach errors would be PVC... I wonder if it's possible / common to send events to deleted Pods. The only way how to get the event would be kubect get event.

Jun 22 '22 15:06 jsafrane

I played around with it yesterday and it certainly is possible to create events for objects that don't exist. It would provide some sort of symmetry when using kubectl get events --field-selector involvedObject.name for example.

Jun 23 '22 10:06 avorima

Alright, then the real fix should be in kubernetes/kubernetes.

A/D controller sends events on volume attach errors here and here. And it does not send anything on detach errors. I am not sure there any ScheduledPods at that time though. Still, it should give you an idea where to start.

Jun 23 '22 12:06 jsafrane

Wouldn't that be kind of far from the error source? This detach operation executor for CSI basically just deletes the VA, so it doesn't directly see the error that the CSI driver produces. I guess the error in the VA status could somehow be turned into an event, but would also affect other types of volumes which I can't test or verify.

In the end this is just to make these errors more visible for users which maybe aren't as aware of VA's. So I could also look into turning this into a documentation enhancement.

Jun 23 '22 14:06 avorima

Wouldn't that be kind of far from the error source?

It is a generic place where all attach errors are reported and thus detach error should be there too. While we're migrating most volume plugins to CSI, there are still some that are in-tree and may benefit from the improved error reporting.

I guess the error in the VA status could somehow be turned into an event, but would also affect other types of volumes which I can't test or verify.

That is the goal, common place for all detach errors. We can help with in-tree error verification.

And yes, you can just improve our docs! You're more that welcome.

Jun 24 '22 11:06 jsafrane

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Sep 22 '22 12:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Oct 22 '22 13:10 k8s-triage-robot

@avorima are you working on this ?

Nov 10 '22 14:11 humblec

@humblec Sorry, I kind of lost track of this. I think we came to the conclusion that the issue is somewhat misplaced here. Work to improve things should probably be done in k/k and k/website. Issues in those repos can still link back here for documentation purposes, but I think it can be closed.

Nov 14 '22 14:11 avorima

@humblec Sorry, I kind of lost track of this. I think we came to the conclusion that the issue is somewhat misplaced here. Work to improve things should probably be done in k/k and k/website. Issues in those repos can still link back here for documentation purposes, but I think it can be closed.

nw.. and thanks for revisiting this issue @avorima . If we have nt opened the trackers in k/k or k/website, I can take that up and close this issue for now.

Nov 14 '22 15:11 humblec

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Dec 14 '22 16:12 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Dec 14 '22 16:12 k8s-ci-robot

external-attacher external-attacher copied to clipboard

Emit events on detach errors

external-attacher
external-attacher copied to clipboard