ganga icon indicating copy to clipboard operation
ganga copied to clipboard

workerDir feature for IGangaFile

Open rob-c opened this issue 8 years ago • 9 comments

I would like to see a remoteSubDir attribute (or alternative) added to IGangaFile, for LocalFile, MassStorageFile and DiracFile.

If defined this should be used to move files into a relative path on the worker node at the start of jobs. In addition it should be used as part of the pattern matching for files when the job has finished.

This has been requested a few times (and may even be an open issue on github that I can't find) and should be less than 50 of lines of code in the correct few places.

Adding this feature removes one major sticking point that prevents some power-users from migrating from 6.0.44, hence why I would very much like to see this in 6.1.19.

rob-c avatar Mar 30 '16 13:03 rob-c

I'd like to do this for 6.1.20 I think

rob-c avatar Apr 18 '16 10:04 rob-c

The problem I came across when looking into this was differences between, e.g. LocalFile, DiracFile and MassStorageFile a well as differences between when these objects are used as input files or output files.

A LocalFile's localDir is where the file is copied from when it is an input file and copied to when it is an output file. remoteDir is always the subdir on the WN. That's simple as a LocalFile has two logical locations (user's computer and WN).

A DiracFile's localDir should be the user's computer if it's an inputfile but what would the remoteDir mean? Would it be the subdir on the WN? When it's an outputfile, where should it be put in the DFC? Inside remoteDir or inside localDir? When outputting, does it expect to find it written to remoteDir on the WN or localDir (it's doing a put from local to remote at that point)? This is complicated because a DiracFile has three logical locations (user's computer, WN and DFC).

There is a similar problem with MassStorageFile. The upcoming SharedFile is further complicated by potentially only having one logical location (the shared file system).

I think that either we need three schema attributes (localDir, WNDir, storageDir) or we need to have some way of distinguishing between input and output files in a clean way.

milliams avatar Apr 26 '16 15:04 milliams

RemoteDir in the case of DiracFile file means a script needs to have a line injected to move the file after it's been made available to the worker node by the DIRAC system. an LFNPath of some sort is needed to sort out the LFN naming strategy of DiracFiles, this is yet another PR where we need to work to get the feature partity working I would argue.

The location of a file on a WN is common to all file types, hence my wanting to introduce a common name for this. The localDir is unique to files which are locally accessible before an upload mechanism of some sort moves them to another storage medium or a download mechanism gets them. (I can imagine a DSTFile for instance which would keep track but not allow you to ever download it as it's too big) stuff like the DFC is unique to the files which handle it.

This has been mentioned in several different topics PRs is it worth having some large todo list for files which would make it easier to know what functionality is needed/missing.

Testing of these becomes tricky too but I've a small script which does things like create new Local/Mass/DiracFiles and then moves them to another medium after the job has finished and this highlights most problems.

rob-c avatar Apr 26 '16 15:04 rob-c

Coming late to this party and not knowing the full story, can I make a possibly naive suggestion to clear up the naming and have it called workerDir rather than remoteDir - that at least makes it's completely clear which machine is being referred to :smile:

[EDIT: Just realised this is what @milliams suggested above. Ignore me!]

drmarkwslater avatar Apr 27 '16 08:04 drmarkwslater

I think that either we need three schema attributes (localDir, WNDir, storageDir) or we need to have some way of distinguishing between input and output files in a clean way.

@milliams Distinguishing between input and output files is hard as often the output of one job is used as the input for another.

egede avatar Apr 27 '16 10:04 egede

Ok, so the three things we need are:

  • localDir - the full(?) path to the directory holding the local file (source for a put and target for a get)
  • workerSubDir - the subdirectory under the working dir on the WN that the file is copied to when it's an input file and copied from when it's an output file
  • remoteDir (or storageDir) - the full(?) path to the directory (e.g. in the DFC or on EOS) where the file is copied from if it's an input file (i.e. DiracFile data) and copied to if it's an output file w.r.t. the WN. For local access to the file, it's used as the target for a put and the source for a get.

If someone wants to chain up an output file as an input file then this will work fine as remoteDir will be consistent in the two use-cases and is independent of where the file gets copied to on the WN (workerSubDir).

These three attributes work as explained above for a DiracFile. For a LocalFile, remoteDir would not make sense (or at least is logically the same as the localDir). LocalFiles are different to the other files types anyway as the WN code is less responsible for dealing with them.

milliams avatar Apr 27 '16 11:04 milliams

I would promote dropping remoteDir completely in favour of workerDir in this case. Nothing has been coded up but we've needed this for a long time now.

localDir is what we expect it to be the directory holding the namePattern locally. This should be used by all get/put methods.

DiracFile needs some relative path to name LFN, some sort of DFCPath? This is only used by jobs when uploading a file.

MassStorageFile (and DiracFile I think) make use of locations to upload a remote file to a location which is on remote storage. These may be real directories on disk or DFC. This is for the file to know and care about.

Can we drop remoteDir in favour of workerDir now to avoid future confusions.

rob-c avatar Apr 27 '16 13:04 rob-c

Is there progress here? Is there a workaround I can use to preserve directory structure on the WN?

rmatev avatar Oct 05 '17 08:10 rmatev

@rmatev The best work-around I know is to submit a bash job which before the user-code runs extracts a tarball which has been in the input-sandbox or an input-LFN. It's a bit of an ugly hack but it works.

(aside) I had planned to implement this after fixing adding proper support for wildcards and automatic DiracFile name support (which is now in). Most of the difficult legwork has been done for this, the interfaces should pass an IGangaFile to the backend which generates the JDL. This means JDL can be expanded based upon the IGangaFile but there needs to be a common way to manage moving files on the WN from the IGangaFile interface. There is almost enough pieces in place to implement this but the various classes which inherit this all need to be expanded to support this.

If someone has a free week or a PhD/summer student to throw at this it's quite a nice little feature to implement and straight-forward as part of a wider project.

rob-c avatar Oct 11 '17 10:10 rob-c