gitpod icon indicating copy to clipboard operation
gitpod copied to clipboard

[observability] create an alert when file descriptors exhausted

Open jenting opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe

create an alert when file descriptors exhausted

Describe the behaviour you'd like

Having an alert when the number of file descriptors exhausted

Describe alternatives you've considered

None

Additional context

https://gitpod.slack.com/archives/C04245JPHKL/p1663083593170859

jenting avatar Sep 14 '22 02:09 jenting

I tried different views of the metrics process_open_fds, to send the alerts as soon as possible.

image

However, view with the sum of process_open_fds, at 09/13 05:30 UTC, the value is still slow.

We could consider sending alerts if the sum of process_open_fd over 3,000,000, but it indicates that the alert will be sent out after 09/13 10:00 UTC. But I think it's too late to send the alert out.

image

Any suggestion to make the alerts send out as soon as possible? @kylos101

jenting avatar Sep 16 '22 06:09 jenting

@jenting can you share the query you found most promising?

kylos101 avatar Sep 19 '22 05:09 kylos101

:wave: hey @jenting , have you tried an alert like this? https://www.robustperception.io/alerting-on-approaching-open-file-limits/

kylos101 avatar Sep 21 '22 01:09 kylos101

Hey @jenting , I recommend trying to write an alert that is node or workspace based, rather than cluster based.

kylos101 avatar Sep 21 '22 01:09 kylos101

For the node-based metric node_filefd_allocated, the grafana query.

If we write an alert that is node-based, and the alert criteria is the current file descriptors / total file descriptors, we can see that the current file descriptors is far far away to total file descriptors. The grafana query.

Therefore, we can't use the criteria node_filefd_allocated{cluster="<cluster-name>"}/node_filefd_maximum{cluster="<cluster-name>"} as the alert rule.

jenting avatar Sep 22 '22 05:09 jenting

@jenting I recommend handing this off to @utam0k , as he is on-call this week, to see if he can finish.

@utam0k perhaps you could look later this week?

kylos101 avatar Oct 05 '22 04:10 kylos101

@jenting This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

utam0k avatar Oct 14 '22 08:10 utam0k

This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?

I agree with you. We could put the threshold 80% and I did check our overall fd usage, we are far away from 80%. (If I remember correctly, the fd usage is under 10%).

jenting avatar Oct 14 '22 08:10 jenting

The problem with the fd of the supervisor should have been a side issue and not the root cause.

utam0k avatar Oct 14 '22 08:10 utam0k

The problem with the fd of the supervisor should have been a side issue and not the root cause.

Yes, I think the ws-manager failed to handle any pod event. It caused all related components which would interact with the ws-manager might be impacted. For example, the component does not handle the connection between the ws-manager correctly.

jenting avatar Oct 14 '22 08:10 jenting

@utam0k does this mean we no longer need an alert? If so, what else is needed before we close this issue? To recap, the intent of this issue was to create an alert.

kylos101 avatar Oct 14 '22 14:10 kylos101

@utam0k if we no longer need an alert, please close this issue as not planned?

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

kylos101 avatar Oct 14 '22 21:10 kylos101

@utam0k if we no longer need an alert, please close this issue as not planned?

kylos101 avatar Oct 17 '22 21:10 kylos101

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

kylos101 avatar Oct 17 '22 21:10 kylos101

@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).

No, we don't need to create a new issue to solve ws-manager failed to handle any pod event.

Let's link to the culprit issue #13007 and close this one.

jenting avatar Oct 17 '22 23:10 jenting

Okay, thanks! I will close this issue as won't fix.

kylos101 avatar Oct 18 '22 02:10 kylos101

Thanks a lot @jenting and @kylos101

utam0k avatar Oct 18 '22 05:10 utam0k