DIRAC
DIRAC copied to clipboard
MJF in TimeLeft
See https://ggus.eu/index.php?mode=ticket_info&ticket_id=162431 DESY claims that our DIRAC pilots do not respect the MACHINEFEATURES/shutdowntime they set, referring to the documents dated in early 2016: https://hepsoftwarefoundation.org/notes/HSF-TN-2016-02.pdf https://twiki.cern.ch/twiki/bin/view/LCG/WMTEGEnvironmentVariables
Looking into the code: https://github.com/DIRACGrid/DIRAC/blob/1b36402acabde8d2295dfe66d9a75f8cdbfd34d7/src/DIRAC/Resources/Computing/BatchSystems/TimeLeft/TimeLeft.py#L146 MJFResourceUsage seems to be used only when the batch system is unknown, is that correct? The pilots running at DESY finds the batch system is HTCondor, and the log reads
MaxRuntime attribute is not supported
Could not determine timeleft for batch system at site LCG.DESY.de
CPUTime for /Resources/Sites/LCG/LCG.DESY.de/CEs/grid-htcondorce0.desy.de/Queues/htcondorce-condor: 216000.000000
There have been some discussions in the past https://github.com/DIRACGrid/DIRAC/issues/4544 JobAgent TimeLeft computation: definitions, multi-core environments, batch system based on wallclock time https://github.com/DIRACGrid/DIRAC/issues/4788 HTCondor TimeLeft module
If MaxRuntime is not available (most HTCondor queues are concerned), setting MaxCPUTime (not too high) should be sufficient.
MJF is not used by the pilot jobs on HTCondor by intention? Not only for getting wallclock time limit, but even for downtime?
Before investigating the code, I am surprised that:
- MJF is still deployed somewhere, while it is completely unsupported, and the last time I check that didn't even had a python3 version working;
- they ask you to comply with that!
- LHCb (and other DIRAC users) also run at DESY and we didn't get such complains.
LHCb was also ticketed: https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429
MJF is used when nothing else is found (for what regards TimeLeft). So, it basically won't be used when there's a known batch system. When this was initially coded we thought of switching the priority once MJF would have been deployed ~everywhere, but this never happened and the MJF project reached a slow death. I will reply in LHCb's ticket.
Just to say that in the UK it's not used, apparently not even by Manchester who invented it.
I have read https://ggus.eu/index.php?mode=ticket_info&ticket_id=162429, and https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpsMinutes230608#HEPScore_status_update
I understand MJF was abandoned because the "numbers being published on the WNs are too unreliable in practice", but I suppose they were the "benchmarking and the CPU and WallClock time available to the job" (#4544)
Maybe it is worthwhile to respect "downtime", for it would not be filled usually?