DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

MJF in TimeLeft

Open iueda opened this issue 2 years ago • 5 comments

See https://ggus.eu/index.php?mode=ticket_info&ticket_id=162431 DESY claims that our DIRAC pilots do not respect the MACHINEFEATURES/shutdowntime they set, referring to the documents dated in early 2016: https://hepsoftwarefoundation.org/notes/HSF-TN-2016-02.pdf https://twiki.cern.ch/twiki/bin/view/LCG/WMTEGEnvironmentVariables

Looking into the code: https://github.com/DIRACGrid/DIRAC/blob/1b36402acabde8d2295dfe66d9a75f8cdbfd34d7/src/DIRAC/Resources/Computing/BatchSystems/TimeLeft/TimeLeft.py#L146 MJFResourceUsage seems to be used only when the batch system is unknown, is that correct? The pilots running at DESY finds the batch system is HTCondor, and the log reads

MaxRuntime attribute is not supported
Could not determine timeleft for batch system at site LCG.DESY.de
CPUTime for /Resources/Sites/LCG/LCG.DESY.de/CEs/grid-htcondorce0.desy.de/Queues/htcondorce-condor: 216000.000000

There have been some discussions in the past https://github.com/DIRACGrid/DIRAC/issues/4544 JobAgent TimeLeft computation: definitions, multi-core environments, batch system based on wallclock time https://github.com/DIRACGrid/DIRAC/issues/4788 HTCondor TimeLeft module

If MaxRuntime is not available (most HTCondor queues are concerned), setting MaxCPUTime (not too high) should be sufficient.

MJF is not used by the pilot jobs on HTCondor by intention? Not only for getting wallclock time limit, but even for downtime?

iueda avatar Jun 22 '23 05:06 iueda

Before investigating the code, I am surprised that:

  • MJF is still deployed somewhere, while it is completely unsupported, and the last time I check that didn't even had a python3 version working;
  • they ask you to comply with that!
  • LHCb (and other DIRAC users) also run at DESY and we didn't get such complains.

fstagni avatar Jun 22 '23 07:06 fstagni

LHCb was also ticketed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429

chrisburr avatar Jun 22 '23 08:06 chrisburr

MJF is used when nothing else is found (for what regards TimeLeft). So, it basically won't be used when there's a known batch system. When this was initially coded we thought of switching the priority once MJF would have been deployed ~everywhere, but this never happened and the MJF project reached a slow death. I will reply in LHCb's ticket.

fstagni avatar Jun 22 '23 12:06 fstagni

Just to say that in the UK it's not used, apparently not even by Manchester who invented it.

marianne013 avatar Jun 22 '23 13:06 marianne013

I have read https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429, and https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes230608#HEPScore_status_update

I understand MJF was abandoned because the "numbers being published on the WNs are too unreliable in practice", but I suppose they were the "benchmarking and the CPU and WallClock time available to the job" (#4544)

Maybe it is worthwhile to respect "downtime", for it would not be filled usually?

iueda avatar Jun 26 '23 12:06 iueda