DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Files in transformations will stay assigned for ever

Open chaen opened this issue 2 years ago • 8 comments

This is quite an edge case. It can happen whenever you have a transformation that creates multiple operations on files.

Suppose the following request is created

- req1: Waiting
   - op1: Waiting
     - LFN1:Waiting
     - LFN2: Waiting
   - op2: Waiting
     -  LFN1: Waiting
     - LFN2:Waiting

If the Op1 fails for LFN1, but succeeds for Op2, we will have the following

- req1: Failed
   - op1: Failed
     - LFN1:Failed
     - LFN2: Done
   - op2: Waiting
     -  LFN1: Waiting
     - LFN2:Waiting

if you then call getRequestFileStatus, you will get:

LFN1: Failed
LFN2: Waiting

And that's the best it can do really.

The problem then is when updating the FileTask status:

https://github.com/DIRACGrid/DIRAC/blob/1cd1173303b177dbfe44ede43d18d25514713301/src/DIRAC/TransformationSystem/Client/RequestTasks.py#L374-L378

We do not take into account the case of LFN2. It is very hard to know what to do. The Request is in a final state (it will never change anymore), but the file is not (it only went half way through the process it has to follow). Setting it Problematic is maybe an option... I think more and more that whenever we have a Transformation with multiple operations per request, we should have a flag saying whether the requests are reentrant or not (i.e. can you re-run them from the beginning at any point in time). If they are not, even resetting task should be forbidden.

Opinion @andresailer ?

chaen avatar Jul 24 '23 11:07 chaen

@sfayer @marianne013 as you start using the TS, you may have opinion too ?

chaen avatar Jul 24 '23 11:07 chaen

I don't think that a file in (RMS) status Waiting should be set as TransformationFilesStatus.PROBLEMATIC: it just feels wrong.

I think we should instead check what is the status of the request first, and if failed set all files to TransformationFilesStatus.PROBLEMATIC

fstagni avatar Jul 24 '23 13:07 fstagni

That's not correct either. Of course, we should take the request status into consideration. But some files can be properly finished, and some not. So setting everything to problematic is not good. The real question is what to do with the files that are only half way through.

chaen avatar Jul 24 '23 13:07 chaen

I don't see any other option but setting TransformationFilesStatus.PROBLEMATIC those files that are in RMS status Waiting for which the request is in status Failed.

fstagni avatar Jul 24 '23 13:07 fstagni

I think this is the correct thing to do in general. The problem is the requests might not be re-entrant. And in this case, we need to fix the request itself, not just resetting the task. That's why I am thinking of adding this safety flag

chaen avatar Jul 25 '23 09:07 chaen

@andresailer another ping

chaen avatar Nov 02 '23 15:11 chaen

Sorry, I don't really have an opinion here. Anything that makes things more fault tolerant sounds good!

andresailer avatar Nov 03 '23 08:11 andresailer

I add an extra case here. Hopefully this issue will (soon) consolidate into unit tests, and possibly a fix

- req1: Waiting
   - op1: Waiting
     - LFN1:Done
     - LFN2: Waiting
   - op2: Done
     -  LFN1: Done
     - LFN2:Done

Both LFN1 and LFN2 will be considered Done, while in fact only LFN1 is. This is the sort of things that happens constantly with replicateAndRegister. It went unnoticed until now as the registration always worked, but it is absolutely possible that a file was marked as processed while it wasn't yet registered.

chaen avatar Jun 25 '24 11:06 chaen