ganga
ganga copied to clipboard
workerDir feature for IGangaFile
I would like to see a remoteSubDir
attribute (or alternative) added to IGangaFile
, for LocalFile
, MassStorageFile
and DiracFile
.
If defined this should be used to move files into a relative path on the worker node at the start of jobs. In addition it should be used as part of the pattern matching for files when the job has finished.
This has been requested a few times (and may even be an open issue on github that I can't find) and should be less than 50 of lines of code in the correct few places.
Adding this feature removes one major sticking point that prevents some power-users from migrating from 6.0.44, hence why I would very much like to see this in 6.1.19.
I'd like to do this for 6.1.20 I think
The problem I came across when looking into this was differences between, e.g. LocalFile
, DiracFile
and MassStorageFile
a well as differences between when these objects are used as input files or output files.
A LocalFile
's localDir
is where the file is copied from when it is an input file and copied to when it is an output file. remoteDir
is always the subdir on the WN. That's simple as a LocalFile
has two logical locations (user's computer and WN).
A DiracFile
's localDir
should be the user's computer if it's an inputfile but what would the remoteDir
mean? Would it be the subdir on the WN? When it's an outputfile, where should it be put in the DFC? Inside remoteDir
or inside localDir
? When outputting, does it expect to find it written to remoteDir
on the WN or localDir
(it's doing a put
from local to remote at that point)? This is complicated because a DiracFile
has three logical locations (user's computer, WN and DFC).
There is a similar problem with MassStorageFile
. The upcoming SharedFile
is further complicated by potentially only having one logical location (the shared file system).
I think that either we need three schema attributes (localDir
, WNDir
, storageDir
) or we need to have some way of distinguishing between input and output files in a clean way.
RemoteDir in the case of DiracFile file means a script needs to have a line injected to move the file after it's been made available to the worker node by the DIRAC system. an LFNPath of some sort is needed to sort out the LFN naming strategy of DiracFiles, this is yet another PR where we need to work to get the feature partity working I would argue.
The location of a file on a WN is common to all file types, hence my wanting to introduce a common name for this. The localDir is unique to files which are locally accessible before an upload mechanism of some sort moves them to another storage medium or a download mechanism gets them. (I can imagine a DSTFile for instance which would keep track but not allow you to ever download it as it's too big) stuff like the DFC is unique to the files which handle it.
This has been mentioned in several different topics PRs is it worth having some large todo list for files which would make it easier to know what functionality is needed/missing.
Testing of these becomes tricky too but I've a small script which does things like create new Local/Mass/DiracFiles and then moves them to another medium after the job has finished and this highlights most problems.
Coming late to this party and not knowing the full story, can I make a possibly naive suggestion to clear up the naming and have it called workerDir
rather than remoteDir
- that at least makes it's completely clear which machine is being referred to :smile:
[EDIT: Just realised this is what @milliams suggested above. Ignore me!]
I think that either we need three schema attributes (localDir, WNDir, storageDir) or we need to have some way of distinguishing between input and output files in a clean way.
@milliams Distinguishing between input and output files is hard as often the output of one job is used as the input for another.
Ok, so the three things we need are:
-
localDir
- the full(?) path to the directory holding the local file (source for aput
and target for aget
) -
workerSubDir
- the subdirectory under the working dir on the WN that the file is copied to when it's an input file and copied from when it's an output file -
remoteDir
(orstorageDir
) - the full(?) path to the directory (e.g. in the DFC or on EOS) where the file is copied from if it's an input file (i.e.DiracFile
data) and copied to if it's an output file w.r.t. the WN. For local access to the file, it's used as the target for aput
and the source for aget
.
If someone wants to chain up an output file as an input file then this will work fine as remoteDir
will be consistent in the two use-cases and is independent of where the file gets copied to on the WN (workerSubDir
).
These three attributes work as explained above for a DiracFile
. For a LocalFile
, remoteDir
would not make sense (or at least is logically the same as the localDir
). LocalFile
s are different to the other files types anyway as the WN code is less responsible for dealing with them.
I would promote dropping remoteDir completely in favour of workerDir in this case. Nothing has been coded up but we've needed this for a long time now.
localDir is what we expect it to be the directory holding the namePattern locally. This should be used by all get/put methods.
DiracFile needs some relative path to name LFN, some sort of DFCPath
? This is only used by jobs when uploading a file.
MassStorageFile (and DiracFile I think) make use of locations
to upload a remote file to a location which is on remote storage. These may be real directories on disk or DFC. This is for the file to know and care about.
Can we drop remoteDir in favour of workerDir
now to avoid future confusions.
Is there progress here? Is there a workaround I can use to preserve directory structure on the WN?
@rmatev The best work-around I know is to submit a bash job which before the user-code runs extracts a tarball which has been in the input-sandbox or an input-LFN. It's a bit of an ugly hack but it works.
(aside) I had planned to implement this after fixing adding proper support for wildcards and automatic DiracFile name support (which is now in). Most of the difficult legwork has been done for this, the interfaces should pass an IGangaFile to the backend which generates the JDL. This means JDL can be expanded based upon the IGangaFile but there needs to be a common way to manage moving files on the WN from the IGangaFile interface. There is almost enough pieces in place to implement this but the various classes which inherit this all need to be expanded to support this.
If someone has a free week or a PhD/summer student to throw at this it's quite a nice little feature to implement and straight-forward as part of a wider project.