Files in transformations will stay assigned for ever
This is quite an edge case. It can happen whenever you have a transformation that creates multiple operations on files.
Suppose the following request is created
- req1: Waiting
- op1: Waiting
- LFN1:Waiting
- LFN2: Waiting
- op2: Waiting
- LFN1: Waiting
- LFN2:Waiting
If the Op1 fails for LFN1, but succeeds for Op2, we will have the following
- req1: Failed
- op1: Failed
- LFN1:Failed
- LFN2: Done
- op2: Waiting
- LFN1: Waiting
- LFN2:Waiting
if you then call getRequestFileStatus, you will get:
LFN1: Failed
LFN2: Waiting
And that's the best it can do really.
The problem then is when updating the FileTask status:
https://github.com/DIRACGrid/DIRAC/blob/1cd1173303b177dbfe44ede43d18d25514713301/src/DIRAC/TransformationSystem/Client/RequestTasks.py#L374-L378
We do not take into account the case of LFN2.
It is very hard to know what to do. The Request is in a final state (it will never change anymore), but the file is not (it only went half way through the process it has to follow).
Setting it Problematic is maybe an option...
I think more and more that whenever we have a Transformation with multiple operations per request, we should have a flag saying whether the requests are reentrant or not (i.e. can you re-run them from the beginning at any point in time). If they are not, even resetting task should be forbidden.
Opinion @andresailer ?
@sfayer @marianne013 as you start using the TS, you may have opinion too ?
I don't think that a file in (RMS) status Waiting should be set as TransformationFilesStatus.PROBLEMATIC: it just feels wrong.
I think we should instead check what is the status of the request first, and if failed set all files to TransformationFilesStatus.PROBLEMATIC
That's not correct either. Of course, we should take the request status into consideration. But some files can be properly finished, and some not. So setting everything to problematic is not good. The real question is what to do with the files that are only half way through.
I don't see any other option but setting TransformationFilesStatus.PROBLEMATIC those files that are in RMS status Waiting for which the request is in status Failed.
I think this is the correct thing to do in general. The problem is the requests might not be re-entrant. And in this case, we need to fix the request itself, not just resetting the task. That's why I am thinking of adding this safety flag
@andresailer another ping
Sorry, I don't really have an opinion here. Anything that makes things more fault tolerant sounds good!
I add an extra case here. Hopefully this issue will (soon) consolidate into unit tests, and possibly a fix
- req1: Waiting
- op1: Waiting
- LFN1:Done
- LFN2: Waiting
- op2: Done
- LFN1: Done
- LFN2:Done
Both LFN1 and LFN2 will be considered Done, while in fact only LFN1 is.
This is the sort of things that happens constantly with replicateAndRegister. It went unnoticed until now as the registration always worked, but it is absolutely possible that a file was marked as processed while it wasn't yet registered.