DIRAC
DIRAC copied to clipboard
Jobs cannot be removed by JobCleaningAgent if user DN changes
Since v7r2p5 (see this commit) jobs are removed using the DN and group of the submitting user.
This causes jobs to get stuck with the Deleted state if a user changes their DN. Presumably similar issues can occur if they change group.
I don't see an obvious way out here. The problem is that the mentioned commit was necessary, IIRC, to assure the removal through RMS requests (when necessary). The question is if this is an issue that we want to solve via code.
Doesn't this need to be changed to do things by nickname instead of DN given that DNs will cease to exist at some point?
Yes, this is something that are envisaging to do from 8.0 release (https://github.com/DIRACGrid/DIRAC/issues/4486) but there's anyway quite some work to be done in that direction, as DNs are quite widespread, not only in WMS. This is not a work for a v7r2 patch.
it's actually much worse than this ! The Agent will create (buggy) removal request at every loop, because it creates the removal requests before attempting to delete the job. https://github.com/DIRACGrid/DIRAC/pull/5414
After discussion in LHCb Ops meeting, one more thing to be done: the agent should try to get the user proxy. If it fails for a temporary reasons, we wait. If it fails for longer, do the actions with the server certificate
After some investigation, I decided to move this issue to 8.0. There's no easy and clean way to solve this right now, as it needs #4486
This has come up again due to the summer students leaving and being suspended in VOMS very quickly after they finish working. It also causes the JobCleaningAgent to become very slow due to the many failover attempts.