WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

Frequent AgentDrainMode Alerts

Open hassan11196 opened this issue 10 months ago • 1 comments

Impact of the bug WMAgent

Describe the bug WMAgent has been frequently sending email alerts with the following message: Agent had a drain status transition to AgentDrainMode = False.

When the disk usage for /data exceeds 85%, the agents automatically set AgentDrainMode to true. This places the agent in drain mode, preventing it from accepting new workflows, and an email alert is sent. As workflows in the agent clear out and get archived the disk usage falls and the agent then sets AgentDrainMode to false causing this cycle to continue.

How to reproduce it Steps to reproduce the behavior: Described above.

Expected behavior Based on Alan's suggestion in the email thread, if the agent is in UserDrainMode, we enable AgentDrainMode (if disk usage is above the threshold) and never switch it back to False.

hassan11196 avatar Apr 26 '24 17:04 hassan11196

Thank you for creating this issue, Ahmed!

Yes, you described it well and the solution looks good to me. Rephrasing the solution in different words, here is my suggestion/pseudo-code: if we want to enable AgentDrainMode and UserDrainMode=True; we set AgentDrainMode=True and fire up an alarm elif we want to disable AgentDrainMode and UserDrainMode=True; log that the agent no longer needs to be in AgentDrainMode, but don't switch it back elif we want to enable AgentDrainMode and UserDrainMode=False; we set AgentDrainMode=True and fire up an alarm elif we want to disable AgentDrainMode and UserDrainMode=False; we set AgentDrainMode=False

BTW, this implementation is around these lines: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/AgentStatusWatcher/DrainStatusPoller.py#L90

amaltaro avatar Apr 26 '24 17:04 amaltaro