DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Cannot update CS after finding dead CS worker

Open chaen opened this issue 1 year ago • 0 comments

When updating the CS from the WebApp, we occasionally get

ERROR: ERROR: AutoMerge failed: Could not AutoMerge. Could not retrieve original committer's version
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/threading.py", line 1002, in _bootstrap
    self._bootstrap_inner()
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 83, in _worker
    work_item.run()
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 349, in _processInThread
    result = self._processProposal(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 536, in _processProposal
    result = self._executeAction(trid, proposalTuple, handlerObj)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/private/Service.py", line 556, in _executeAction
    response = handlerObj._rh_executeAction(proposalTuple)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 120, in _rh_executeAction
    retVal = self.__doRPC(actionTuple[1])
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 251, in __doRPC
    return self.__RPCCallFunction(method, args)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/Core/DISET/RequestHandler.py", line 292, in __RPCCallFunction
    uReturnValue = oMethod(*args)
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/ConfigurationSystem/Service/ConfigurationHandler.py", line 71, in export_commitNewData
    return gServiceInterface.updateConfiguration(sData, credDict["username"])
  File "/opt/dirac/versions/v11.0.52-1733134524/Linux-x86_64/lib/python3.11/site-packages/DIRAC/ConfigurationSystem/private/ServiceInterfaceBase.py", line 219, in updateConfiguration
    return S_ERROR(f"AutoMerge failed: {result['Message']}")

This is due to the CS not finding a correct backup in https://github.com/DIRACGrid/DIRAC/blob/c4b7a6e009e03570cecfff2b8499356d6e9551e7/src/DIRAC/ConfigurationSystem/private/ServiceInterfaceBase.py#L267

This method finds the latest backup by looking at the zip files containing the date found in the client's DIRAC/Configuration/Version. This version is distributed by the client by the CS, so there's no real reason it would be wrong. Except when a slave is found dead. In that case, a new version is generated:

@400000006756768d1d106ff4.s-57426-2024-12-09 04:46:50 UTC Configuration/Server [140072925378112] WARN: Found dead slave dips://speen.nikhef.nl:9135/Configuration/Server
@400000006756768d1d106ff4.s:57428:2024-12-09 04:46:51 UTC Configuration/Server [140072925378112] INFO: Generated new version 2024-12-09 04:46:51.020183

But this version is never actually committed (and we do not want to). So there's no backup file corresponding to that date.

chaen avatar Dec 09 '24 09:12 chaen