DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Jobs cannot be removed by JobCleaningAgent if user DN changes

Open chrisburr opened this issue 4 years ago • 7 comments

Since v7r2p5 (see this commit) jobs are removed using the DN and group of the submitting user.

This causes jobs to get stuck with the Deleted state if a user changes their DN. Presumably similar issues can occur if they change group.

chrisburr avatar Aug 24 '21 07:08 chrisburr

I don't see an obvious way out here. The problem is that the mentioned commit was necessary, IIRC, to assure the removal through RMS requests (when necessary). The question is if this is an issue that we want to solve via code.

fstagni avatar Sep 13 '21 10:09 fstagni

Doesn't this need to be changed to do things by nickname instead of DN given that DNs will cease to exist at some point?

chrisburr avatar Sep 13 '21 11:09 chrisburr

Yes, this is something that are envisaging to do from 8.0 release (https://github.com/DIRACGrid/DIRAC/issues/4486) but there's anyway quite some work to be done in that direction, as DNs are quite widespread, not only in WMS. This is not a work for a v7r2 patch.

fstagni avatar Sep 17 '21 12:09 fstagni

it's actually much worse than this ! The Agent will create (buggy) removal request at every loop, because it creates the removal requests before attempting to delete the job. https://github.com/DIRACGrid/DIRAC/pull/5414

chaen avatar Sep 17 '21 12:09 chaen

After discussion in LHCb Ops meeting, one more thing to be done: the agent should try to get the user proxy. If it fails for a temporary reasons, we wait. If it fails for longer, do the actions with the server certificate

chaen avatar Oct 04 '21 10:10 chaen

After some investigation, I decided to move this issue to 8.0. There's no easy and clean way to solve this right now, as it needs #4486

fstagni avatar Nov 17 '21 09:11 fstagni

This has come up again due to the summer students leaving and being suspended in VOMS very quickly after they finish working. It also causes the JobCleaningAgent to become very slow due to the many failover attempts.

chrisburr avatar Sep 19 '22 10:09 chrisburr