WMCore
WMCore copied to clipboard
Frequent AgentDrainMode Alerts
Impact of the bug WMAgent
Describe the bug
WMAgent has been frequently sending email alerts with the following message:
Agent had a drain status transition to AgentDrainMode = False.
When the disk usage for /data exceeds 85%, the agents automatically set AgentDrainMode to true. This places the agent in drain mode, preventing it from accepting new workflows, and an email alert is sent.
As workflows in the agent clear out and get archived the disk usage falls and the agent then sets AgentDrainMode
to false causing this cycle to continue.
How to reproduce it Steps to reproduce the behavior: Described above.
Expected behavior
Based on Alan's suggestion in the email thread, if the agent is in UserDrainMode
, we enable AgentDrainMode
(if disk usage is above the threshold) and never switch it back to False.
Thank you for creating this issue, Ahmed!
Yes, you described it well and the solution looks good to me. Rephrasing the solution in different words, here is my suggestion/pseudo-code:
if we want to enable AgentDrainMode
and UserDrainMode=True
;
we set AgentDrainMode=True and fire up an alarm
elif we want to disable AgentDrainMode
and UserDrainMode=True
;
log that the agent no longer needs to be in AgentDrainMode, but don't switch it back
elif we want to enable AgentDrainMode
and UserDrainMode=False
;
we set AgentDrainMode=True and fire up an alarm
elif we want to disable AgentDrainMode
and UserDrainMode=False
;
we set AgentDrainMode=False
BTW, this implementation is around these lines: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/DrainStatusPoller.py#L90