WMCore icon indicating copy to clipboard operation
WMCore copied to clipboard

ACDC failures due to missing policy args - DQMHarvest not supported

Open todor-ivanov opened this issue 1 year ago • 7 comments

Impact of the bug GlobalWorkQueue

Describe the bug While going through the Global WorkQueue logs I've stumbled on the following exception [1]. What I can say from a first glance it is happening only for ACDC workflows which are relying on default policy parameters:

INFO:reqmgrInteraction:Splitting /haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged with policy name ResubmitBlock and policy par
ams {'name': 'ResubmitBlock', 'args': {}}

Which are set here: https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/WorkQueue.py#L1063

These types of ACDCs end up referring to ResubmitBlock: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py

NOTE: Even though in the TraceBack the error stems from File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__ self.split(), the ResubmitBlock policy redefines the default self.split() method from https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py#L48

with with its own:

https://github.com/dmwm/WMCore/blob/d89fc9ddee0e405e09f1deaa4fdeb895bf445947/src/python/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py#L50

for which no key NumberOfRuns is set. But how we end up referring to this key in the policy object is still a mystery to me.

In addition, those workflows are constantly retried and are filling the GWQ logs.

How to reproduce it I have not yet figured out the full set of ACDC parameter that trigger such behavior.

Expected behavior To properly map all possible ACDC parameters to the given policy.

Additional context and error message [1]

INFO:reqmgrInteraction:Splitting /haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19/DataProcessingMergeDQMoutputEndOfRunDQMHarvestMerged with policy name ResubmitBlock and policy params {'name': 'ResubmitBlock', 'args': {}}
ERROR:reqmgrInteraction:Exception splitting wqe haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19 for haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19: 'NumberOfRuns'
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1133, in processInboundWork
    work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1073, in _splitWork
    units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
    self.split()
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 70, in split
    Jobs=ceil(float(block[self.args['SliceType']]) /
KeyError: 'NumberOfRuns'
ERROR:reqmgrInteraction:Unknown error processing haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171345_19
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueueReqMgrInterface.py", line 108, in queueNewRequests
    units = queue.queueWork(workLoadUrl, request=reqName, team=team)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 646, in queueWork
    work = self.processInboundWork(inbound, throw=True)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1133, in processInboundWork
    work, rejectedWork, badWork = self._splitWork(inbound['WMSpec'], data=inbound['Inputs'],
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/WorkQueue.py", line 1073, in _splitWork
    units, rejectedWork, badWork = policy(spec, topLevelTask, data, mask, continuous=continuous)
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/StartPolicyInterface.py", line 160, in __call__
    self.split()
  File "/usr/local/lib/python3.8/site-packages/WMCore/WorkQueue/Policy/Start/ResubmitBlock.py", line 70, in split
    Jobs=ceil(float(block[self.args['SliceType']]) /
KeyError: 'NumberOfRuns'

todor-ivanov avatar Mar 08 '23 10:03 todor-ivanov