WMCore
WMCore copied to clipboard
JobAccountant - component failure due to a thrown SQL exception from a single workflow
Impact of the bug WMagent - JobAccountant
Describe the bug
During the work of migration to storage.json
for stage in
and stage out
, we have stumbled on a failure [1] of the JobAccountant component which was caused due to a broken SQL query in one of the steps of the component's cycle.
This was firstly reported here: https://github.com/dmwm/WMCore/pull/11790#issuecomment-1855343294
Even though the origin of the error is in a change that has left a WMBS file record without location
, the effect is definitely undesired - the component completely failed. Even more, once the workflow itself has been aborted, the records in WMBS were still remaining and the component continued crashing. The place in JobAccountant, where this exception is thrown is here:
https://github.com/dmwm/WMCore/blob/87ea437f508308b5e1d544a234a80e2a9911d5c1/src/python/WMComponent/JobAccountant/AccountantWorker.py#L836-L867
How to reproduce it By breaking a file record in WMBS
Expected behavior The component should not break due to a single broken record in the database. Instead:
- The error should be properly recorded/logged
- The workflow to which this broken record relates to be blocked to progress
- A proper alarm should be fired and reach the Operations Team in order to start an investigation
Additional context and error message [1]
2023-12-13 18:46:19,912:140004852008704:WARNING:AccountantWorker:The following file does not have any location:
{'acquisitionEra': 'DMWM_Test',
'branch_hash': 'eea4c0314f37f02ed8c1189b31abed42',
'branches': [],
'catalog': '',
'checksums': {'adler32': '95f78c91', 'cksum': '519409673'},
'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;a8fc2cf8e27cd8dd50cfde908ca8115f',
'dataset': {'applicationName': 'cmsRun',
'applicationVersion': 'CMSSW_12_4_14_patch1',
'dataTier': 'MINIAOD',
'primaryDataset': 'MuonEG',
'processedDataset': 'DMWM_Test-ReReco_Run2022C_LumiMask_StageOutTest_Todor_v4-v11'},
'events': 191,
'fileRef': '<WMCore.Configuration.ConfigSection object at 0x7f556af7d610>',
'first_event': 0,
'globalTag': '124X_dataRun3_v15',
'guid': '439838d7-40d1-480e-bc5c-d2d1f404d250',
'input': ['/store/data/Run2022C/MuonEG/RAW/v1/000/355/865/00000/7c5c1c3c-178a-41c5-92ad-73e3b6dd09c8.root'],
'inputPath': '/MuonEG/Run2022C-v1/RAW',
'inputpfns': ['root://cmsxrootd-site.fnal.gov//store/data/Run2022C/MuonEG/RAW/v1/000/355/865/00000/7c5c1c3c-178a-41c5-92ad-73e3b6dd09c8.root'],
'last_event': 0,
'lfn': '/store/unmerged/DMWM_Test/MuonEG/MINIAOD/ReReco_Run2022C_LumiMask_StageOutTest_Todor_v4-v11/00000/439838d7-40d1-480e-bc5c-d2d1f404d250.root',
'locations': set(),
'merged': False,
'module_label': 'MINIAODoutput',
'outputModule': 'MINIAODoutput',
'parents': set(),
'pfn': '/srv/job/WMTaskSpace/cmsRun1/MINIAODoutput.root',
'prep_id': 'TEST-ReReco-Run2022C-MuonEG-testlumi-00001',
'processingVer': 11,
'runs': '{<WMCore.DataStructs.Run.Run object at 0x7f556a7287c0>}',
'size': 13836561,
'validStatus': 'PRODUCTION'}
2023-12-13 18:46:19,912:140004852008704:WARNING:AccountantWorker:Job 78 , bad jobReport, failing job
2023-12-13 18:46:19,958:140004852008704:ERROR:AccountantWorker:Error while adding files to WMBS!
(MySQLdb.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ':location\n AND wfd.lfn = :lfn' at line 4")
[SQL: INSERT INTO wmbs_file_location (fileid, pnn)
SELECT wfd.id, wpnn.id
FROM wmbs_pnns wpnn, wmbs_file_details wfd
WHERE wpnn.pnn = :location
AND wfd.lfn = :lfn]
(Background on this error at: https://sqlalche.me/e/14/f405)
2023-12-13 18:46:20,057:140004852008704:ERROR:BaseWorkerThread:Error in worker algorithm (1):
Backtrace:
<WMComponent.JobAccountant.JobAccountantPoller.JobAccountantPoller object at 0x7f556b7f2b80> <@========== WMException Start ==========@>
Exception Class: AccountantWorkerException
Message: Error while adding files to WMBS!
(MySQLdb.ProgrammingError) (1064, "You have an error in your SQL syntax; check the manual that corresponds to your MariaDB server version for the right syntax to use near ':location\n AND wfd.lfn = :lfn' at line 4")
[SQL: INSERT INTO wmbs_file_location (fileid, pnn)
SELECT wfd.id, wpnn.id
FROM wmbs_pnns wpnn, wmbs_file_details wfd
WHERE wpnn.pnn = :location
AND wfd.lfn = :lfn]
(Background on this error at: https://sqlalche.me/e/14/f405)
ClassName : None
ModuleName : WMComponent.JobAccountant.AccountantWorker
MethodName : handleWMBSFiles
ClassInstance : None
FileName : /data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py
LineNumber : 867
ErrorNr : 0
Traceback:
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 851, in handleWMBSFiles
self.setFileLocation.execute(lfn=fileLocations,
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/WMBS/MySQL/Files/SetLocationByLFN.py", line 35, in execute
self.dbi.processData(self.sql, binds, conn=conn, transaction=transaction)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 148, in processData
r = self.executebinds(i, connection=connection,
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/MySQLCore.py", line 119, in executebinds
return DBInterface.executebinds(self, s, b, connection, returnCursor)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBCore.py", line 62, in executebinds
resultProxy = connection.execute(s)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1365, in execute
return self._exec_driver_sql(
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1669, in _exec_driver_sql
ret = self._execute_context(
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1943, in _execute_context
self._handle_dbapi_exception(
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 2124, in _handle_dbapi_exception
util.raise_(
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
raise exception
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1900, in _execute_context
self.dialect.do_execute(
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-sqlalchemy/1.4.40/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
cursor.execute(statement, parameters)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/cursors.py", line 319, in _query
db.query(q)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/external/py3-mysqlclient/2.1.1/lib/python3.8/site-packages/MySQLdb/connections.py", line 254, in query
_mysql.connection.query(self, query)
<@---------- WMException End ----------@> File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/WorkerThreads/BaseWorkerThread.py", line 183, in __call__
tSpent, results, _ = algorithmWithDBExceptionHandler(parameters)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMCore/Database/DBExceptionHandler.py", line 41, in wrapper
return f(*args, **kwargs)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/Utils/Timers.py", line 57, in wrapper
res = func(*arg, **kw)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/JobAccountantPoller.py", line 70, in algorithm
self.accountantWorker(jobsSlice)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 247, in __call__
self.handleWMBSFiles(self.wmbsFilesToBuild, self.parentageBinds)
File "/data/srv/wmagent/v2.2.5/sw/slc7_amd64_gcc630/cms/wmagentpy3/2.2.5/lib/python3.8/site-packages/WMComponent/JobAccountant/AccountantWorker.py", line 867, in handleWMBSFiles
raise AccountantWorkerException(msg)
2023-12-13 18:46:20,057:140004852008704:INFO:Harness:>>>Terminating worker threads
@todor-ivanov Todor, I suspect this is actually a buggy job report with:
'locations': set(),
instead of an incomplete SQL statement. If there is something to fix, I would say we need to ensure that job reports report a non-empty location for the output files.
hi @amaltaro
Yes, I completely agree with you - The current reason is the a missing WMBS file location, which I already mentioned. And we even know the reason why this location for the current workflow was left empty. And this reason is not related to the current bug, but was rather a consequence of a separate development and a test in the context of the migration to the storage.json
. The current issue I created, because this whole set of DAO calls cited in the GH description, would inevitably lead to a complete crash of the component triggered by a single workflow (a file set with missing location this time or any other error of the sort in the future). To me, we should protect the component from interrupting under such scenario.
Yes, the approach we have been taking with the WMAgent components is that some errors are supposed to be soft while others are hard errors and a crash of the component is necessary to catch developers attention to properly investigate the issue. I agree that not having components crashing is the ultimate goal, but I fear that we won't pay enough attention to problems if we just treat everything as a soft error and leave developers with a sole notification of the error.
For this specific "no locations" issue, we used to have the same problem over the past many years and given the failure to properly identify and fix it at the root, we made this workaround: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobAccountant/AccountantWorker.py#L504-L508
which I expected to catch this case as well.
In addition to the solution you provided, it's not clear to me what you suggest to be done with the potentially ill job? Would retry it until someone take care of that?