WMCore
WMCore copied to clipboard
ACDC failures due to missing policy args - DQMHarvest not supported
Impact of the bug GlobalWorkQueue
Describe the bug While going through the Global WorkQueue logs I've stumbled on the following exception [1]. What I can say from a first glance it is happening only for ACDC workflows which are relying on default policy parameters:
INFO:reqmgrInteraction:Splitting /haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged with policy name ResubmitBlock and policy par
ams {'name': 'ResubmitBlock', 'args': {}}
Which are set here: https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/WorkQueue.py#L1063
These types of ACDCs end up referring to ResubmitBlock
: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py
NOTE: Even though in the TraceBack the error stems from File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__ self.split()
, the ResubmitBlock
policy redefines the default self.split()
method from https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py#L48
with with its own:
https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py#L50
for which no key NumberOfRuns
is set. But how we end up referring to this key in the policy object is still a mystery to me.
In addition, those workflows are constantly retried and are filling the GWQ logs.
How to reproduce it I have not yet figured out the full set of ACDC parameter that trigger such behavior.
Expected behavior To properly map all possible ACDC parameters to the given policy.
Additional context and error message [1]
INFO:reqmgrInteraction:Splitting /haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged with policy name ResubmitBlock and policy params {'name': 'ResubmitBlock', 'args': {}}
ERROR:reqmgrInteraction:Exception splitting wqe haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19 for haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19: 'NumberOfRuns'
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1133, in processInboundWork
work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1073, in _splitWork
units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
self.split()
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 70, in split
Jobs=ceil(float(block[self.args['SliceType']]) /
KeyError: 'NumberOfRuns'
ERROR:reqmgrInteraction:Unknown error processing haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueueReqMgrInterface.py", line 108, in queueNewRequests
units = queue.queueWork(workLoadUrl, request=reqName, team=team)
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 646, in queueWork
work = self.processInboundWork(inbound, throw=True)
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1133, in processInboundWork
work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1073, in _splitWork
units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
self.split()
File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 70, in split
Jobs=ceil(float(block[self.args['SliceType']]) /
KeyError: 'NumberOfRuns'