toil icon indicating copy to clipboard operation
toil copied to clipboard

fileJobStore needs to handle arbitrary (or permission denied) os.link errors to work on some filesystems

Open SpheMakh opened this issue 6 years ago • 5 comments

I'm trying to run this simple workflow.

cwltoil version: 3.15.0 cwltool version: 1.0.20180525185854

cwlfile [touchfile.cwl]

cwlVersion: v1.0
class: CommandLineTool

baseCommand: touch

hints:
  DockerRequirement:
      dockerPull: ubuntu:18.04

arguments: 
  - output-$(inputs.num).txt


inputs:
  infile:
    type: File
  num: 
    type: int 

outputs:
  outfile:
    type: File
    outputBinding:
      glob: output-$(inputs.num).txt

workflow file [multi.cwl]

cwlVersion: v1.0
class: Workflow

requirements:
  - class: SubworkflowFeatureRequirement
  - class: ScatterFeatureRequirement

inputs:
  nums: int[]
  infile: File

outputs:
  output:
    type: File[]
    outputSource: touchfile/outfile

steps:
  touchfile:
    run: touchfile.cwl
    in:
      infile: infile
      num: nums

    scatter: num

    out:
      [outfile]

job file [job.yml]

nums: [1,2,3,4,5]
infile:
  class: File
  path: message.txt

I then run this as

$ cwltoil --singularity --outdir `pwd`/outdir --jobStore `pwd`/JobStore  --workDir work mutli.cwl job.ymland get the following error:
=========> Failed job 'file:///users/sphe/cwl-crash/touchfile.cwl' touch T/9/jobyXgJmQ 
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.
Got workflow error
Traceback (most recent call last):
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/executors.py", line 100, in run_jobs
    for r in jobiter:
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/command_line_tool.py", line 382, in job
    builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs)
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 233, in makePathMapper
    get_file=kwargs["toil_get_file"])
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 196, in __init__
    stagedir, separateDirs=separateDirs)
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 218, in __init__
    self.setup(dedup(referenced_files), basedir)
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 271, in setup
    self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 217, in visit
    resolved = self.get_file(loc) if self.get_file else loc
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 268, in toilGetFile
    srcPath = fileStore.readGlobalFile(fileStoreID[7:])
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/fileStore.py", line 1659, in readGlobalFile
    self.jobStore.readFile(fileStoreID, localFilePath, symlink=symlink)
  File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 309, in readFile
    os.link(jobStoreFilePath, localFilePath)
OSError: [Errno 1] Operation not permitted

But not if I don't specify --workDir

┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-262

SpheMakh avatar Jun 05 '18 11:06 SpheMakh

I get the same issue. Has this been addressed? It seems I have toil version 3.14.0.

olekto avatar Feb 21 '20 10:02 olekto

We substantially revised how we attempt to make hard links from the fileJobStore in Toil 3.20.0. It looks like we don't handle [Errno 1] Operation not permitted specifically, but we use a more robust approach and we print some important debugging info (i.e. what it is trying to link to where) that could be used to solve the problem.

@Tierhon Could you retry with Toil 3.20.0 or later?

If that doesn't help, we could add this error to the ones that mean a hard link just isn't possible, rather than that something has gone wrong in Toil's internals. It might be that you happen to have a file system where for whatever reason hardlinking can be disallowed by permissions. Do you happen to be working with input files that aren't owned by the user running Toil?

adamnovak avatar Feb 21 '20 19:02 adamnovak

@adamnovak I'm updating to 3.24.0 now, and restarting cactus, so hopefully that will work.

The filesystem is BeeGFS. It seems that versions previous to 7 could only create hardlinks in the same folder, and not across folders. We have 7.1.3 which should not have that limitation. Fingers crossed.

Ole

olekto avatar Feb 21 '20 21:02 olekto

@adamnovak At least I do not get this specific error anymore, but another: "TypeError: _runner() got an unexpected keyword argument 'defer'" This is the same as #2854 I guess.

Our computing cluster do not have Docker or Singularity set up, but I was glad when I saw cactus in conda. However, that version seems a bit outdated, and I am not really sure I would install cactus all on my own. The cluster should implement Docker/Singularity soon, and I guess I can revisit cactus then.

Ole

olekto avatar Feb 23 '20 07:02 olekto

OK. Sorry Cactus is giving you trouble now.

I'll change this issue to track that some filesystems exist with this weird hard link limitation; I wasn't aware of that. We probably need to change the code here to account for permission or whatever other weird errors they raise to enforce that and fall back on copying in those cases instead of complaining that something is wrong:

https://github.com/DataBiosphere/toil/blob/993be0c3d95c83ca969d783e708c48d918407b16/src/toil/jobStores/fileJobStore.py#L442-L460

adamnovak avatar Feb 24 '20 19:02 adamnovak

I've been running the precompiled Cactus binary on an HPC and running into the same os.link error described above.

Batch system: slurm 22.05.8 Filesystem: BeeGFS 7.2.9 Python 3.9.12 Toil 5.9.2 Cactus 2.5.0

BeeGFS still doesn't support hardlinking across different directories, which yields OSError PermissionError (EPERM). This error should be caught by the snippet from toil.jobStores.fileJobStore read_file linked above, but is not properly handled if it first encounters a temp file that already exists. Instead, the first EEXIST error is caught, but the next os.link(jobStoreFilePath, localFilePath) line can still raise a fatal EPERM error if hardlinking across directories is not allowed.

Example error log:

Traceback (most recent call last):
	  File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 533, in read_file
	    os.link(jobStoreFilePath, local_path)
	FileExistsError: [Errno 17] File exists: '/scratch2/cactus_jobstore/files/for-job/kind-LastzRepeatMaskJob/instance-4r7rrzq9/file-661334c1003a400d88b74450f4dfd269/Saccharina_latissima_0.maskedQeury' -> '/scratch2/cactus_tmp/76f125d1e7f95a3497d84aebb1968795/7e01/f1ea/tmp958sbruu.tmp'
	
	During handling of the above exception, another exception occurred:
	
	Traceback (most recent call last):
	  File "~/cactus_env/lib/python3.9/site-packages/toil/worker.py", line 403, in workerScript
	    job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
	  File "~/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 908, in _runner
	    super(RoundedJob, self)._runner(*args, jobStore=jobStore,
	  File "~/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2743, in _runner
	    returnValues = self._run(jobGraph=None, fileStore=fileStore)
	  File "~/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2660, in _run
	    return self.run(fileStore)
	  File "~/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in run
	    chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
	  File "~/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in <listcomp>
	    chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
	  File "~/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 918, in readGlobalFileWithoutCache
	    fileStore.jobStore.readFile(jobStoreID, f)
	  File "~/cactus_env/lib/python3.9/site-packages/toil/lib/compatibility.py", line 12, in call
	    return func(*args, **kwargs)
	  File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/abstractJobStore.py", line 1273, in readFile
	    return self.read_file(jobStoreFileID, localFilePath, symlink)
	  File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 543, in read_file
	    os.link(jobStoreFilePath, local_path)
	PermissionError: [Errno 1] Operation not permitted: '/scratch2/cactus_jobstore/files/for-job/kind-LastzRepeatMaskJob/instance-4r7rrzq9/file-661334c1003a400d88b74450f4dfd269/Saccharina_latissima_0.maskedQeury' -> '/scratch2/cactus_tmp/76f125d1e7f95a3497d84aebb1968795/7e01/f1ea/tmp958sbruu.tmp'
	[2023-04-04T20:15:46-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host ...

In case anyone else is still dealing with this issue in 2023, this is the edit to fileJobStore.py (starting at line 532 in the read_file definition) that worked for me:

try:
    os.link(jobStoreFilePath, local_path)
    # It worked!
    return
except OSError as e:
    # For the list of the possible errno codes, see: https://linux.die.net/man/2/link
    if e.errno == errno.EEXIST:
    # Overwrite existing file, emulating shutil.copyfile().
        os.unlink(local_path)
        # It would be very unlikely to fail again for same reason but possible
        # nonetheless in which case we should just give up.
        try:
            os.link(jobStoreFilePath, local_path)
            # Now we succeeded and don't need to copy
            return
        except:
            # Handles BeeGFS error where hardlinking between directories is not permitted
            pass
    elif e.errno == errno.EXDEV:
...

kdews avatar Apr 13 '23 18:04 kdews