WMCore
WMCore copied to clipboard
MSRuleCleaner not processing workflows and not archiving workflows
Impact of the bug Inform which systems get affected by this bug. Which agent(s)? Which central service(s)?
Describe the bug MSRuleCleaner has been building up a large backlog for more than a week now, and it turns out we are failing most of the service cycles at NGinx with 502 Bad Gateway. We seem to be having a problem every now and then with CMSWEB frontends, and it is only making the situation worse for MSRuleCleaner.
How to reproduce it Steps to reproduce the behavior:
Expected behavior MSRuleCleaner processes workflows and archives them appropriately
Additional context and error message Here is a full error message:
2024-07-11 20:15:56,801:ERROR:MSRuleCleaner: ('Unknown exception while fetching requests from ReqMgr2. Error: %s', "url=https://cmsweb.cern.ch:8443/reqmgr2/data/request?status=announced&detail=True, code=502, reason=Bad Gateway, headers={'Date': 'Thu, 11 Jul 2024 20:15:56 GMT', 'Server': 'Apache', 'Content-Type': 'text/html', 'Content-Length': '150', 'CMS-Server-Time': 'D=54890905 t=1720728901909478'}, result=b'<html>\\r\\n<head><title>502 Bad Gateway</title></head>\\r\\n<body>\\r\\n<center><h1>502 Bad Gateway</h1></center>\\r\\n<hr><center>nginx</center>\\r\\n</body>\\r\\n</html>\\r\\n'")
Dumping the cURL
output in a json file makes it a very large file ~220MB. However, it's only that large because we are unable to consume any requests for > 7 days.
There are currently no limits in the NGINX, so it could be the frontend pod decided to terminate this large request.