WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

MSRuleCleaner not processing workflows and not archiving workflows

Open d-ylee opened this issue 7 months ago • 9 comments

Impact of the bug Inform which systems get affected by this bug. Which agent(s)? Which central service(s)?

Describe the bug MSRuleCleaner has been building up a large backlog for more than a week now, and it turns out we are failing most of the service cycles at NGinx with 502 Bad Gateway. We seem to be having a problem every now and then with CMSWEB frontends, and it is only making the situation worse for MSRuleCleaner.

How to reproduce it Steps to reproduce the behavior:

Expected behavior MSRuleCleaner processes workflows and archives them appropriately

Additional context and error message Here is a full error message:

2024-07-11 20:15:56,801:ERROR:MSRuleCleaner: ('Unknown exception while fetching requests from ReqMgr2. Error: %s', "url=https://cmsweb.cern.ch:8443/reqmgr2/data/request?status=announced&detail=True, code=502, reason=Bad Gateway, headers={'Date': 'Thu, 11 Jul 2024 20:15:56 GMT', 'Server': 'Apache', 'Content-Type': 'text/html', 'Content-Length': '150', 'CMS-Server-Time': 'D=54890905 t=1720728901909478'}, result=b'<html>\\r\\n<head><title>502 Bad Gateway</title></head>\\r\\n<body>\\r\\n<center><h1>502 Bad Gateway</h1></center>\\r\\n<hr><center>nginx</center>\\r\\n</body>\\r\\n</html>\\r\\n'")

Dumping the cURL output in a json file makes it a very large file ~220MB. However, it's only that large because we are unable to consume any requests for > 7 days.

There are currently no limits in the NGINX, so it could be the frontend pod decided to terminate this large request.

d-ylee avatar Jul 12 '24 19:07 d-ylee