gitpod
gitpod copied to clipboard
[observability] create an alert when file descriptors exhausted
Is your feature request related to a problem? Please describe
create an alert when file descriptors exhausted
Describe the behaviour you'd like
Having an alert when the number of file descriptors exhausted
Describe alternatives you've considered
None
Additional context
https://gitpod.slack.com/archives/C04245JPHKL/p1663083593170859
I tried different views of the metrics process_open_fds
, to send the alerts as soon as possible.
However, view with the sum of process_open_fds
, at 09/13 05:30 UTC, the value is still slow.
We could consider sending alerts if the sum of process_open_fd
over 3,000,000, but it indicates that the alert will be sent out after 09/13 10:00 UTC. But I think it's too late to send the alert out.
Any suggestion to make the alerts send out as soon as possible? @kylos101
@jenting can you share the query you found most promising?
:wave: hey @jenting , have you tried an alert like this? https://www.robustperception.io/alerting-on-approaching-open-file-limits/
Hey @jenting , I recommend trying to write an alert that is node or workspace based, rather than cluster based.
For the node-based metric node_filefd_allocated
, the grafana query.
If we write an alert that is node-based, and the alert criteria is the current file descriptors / total file descriptors, we can see that the current file descriptors is far far away to total file descriptors. The grafana query.
Therefore, we can't use the criteria node_filefd_allocated{cluster="<cluster-name>"}/node_filefd_maximum{cluster="<cluster-name>"}
as the alert rule.
@jenting I recommend handing this off to @utam0k , as he is on-call this week, to see if he can finish.
@utam0k perhaps you could look later this week?
@jenting This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?
This incident was caused by ws-manager with PVC after all, right? So I don't think the alert really needs to be issued until just before the fd of the node is depleted, which is about 80% of the time. What do you think?
I agree with you. We could put the threshold 80% and I did check our overall fd usage, we are far away from 80%. (If I remember correctly, the fd usage is under 10%).
The problem with the fd of the supervisor should have been a side issue and not the root cause.
The problem with the fd of the supervisor should have been a side issue and not the root cause.
Yes, I think the ws-manager failed to handle any pod event. It caused all related components which would interact with the ws-manager might be impacted. For example, the component does not handle the connection between the ws-manager correctly.
@utam0k does this mean we no longer need an alert? If so, what else is needed before we close this issue? To recap, the intent of this issue was to create an alert.
@utam0k if we no longer need an alert, please close this issue as not planned?
@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.
? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).
@utam0k if we no longer need an alert, please close this issue as not planned?
@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).
@jenting is there a separate issue that needs to be created to solve s-manager failed to handle any pod event.? If yes, can you share if this is related to PVC or in general? I ask to limit scope, so we can focus on closing this issue (either by creating an alert or losing this because we don't need an alert, and making a separate issue to track if needed).
No, we don't need to create a new issue to solve ws-manager failed to handle any pod event.
Let's link to the culprit issue #13007 and close this one.
Okay, thanks! I will close this issue as won't fix.
Thanks a lot @jenting and @kylos101