toil
toil copied to clipboard
fileJobStore needs to handle arbitrary (or permission denied) os.link errors to work on some filesystems
I'm trying to run this simple workflow.
cwltoil version: 3.15.0 cwltool version: 1.0.20180525185854
cwlfile [touchfile.cwl]
cwlVersion: v1.0
class: CommandLineTool
baseCommand: touch
hints:
DockerRequirement:
dockerPull: ubuntu:18.04
arguments:
- output-$(inputs.num).txt
inputs:
infile:
type: File
num:
type: int
outputs:
outfile:
type: File
outputBinding:
glob: output-$(inputs.num).txt
workflow file [multi.cwl]
cwlVersion: v1.0
class: Workflow
requirements:
- class: SubworkflowFeatureRequirement
- class: ScatterFeatureRequirement
inputs:
nums: int[]
infile: File
outputs:
output:
type: File[]
outputSource: touchfile/outfile
steps:
touchfile:
run: touchfile.cwl
in:
infile: infile
num: nums
scatter: num
out:
[outfile]
job file [job.yml]
nums: [1,2,3,4,5]
infile:
class: File
path: message.txt
I then run this as
$ cwltoil --singularity --outdir `pwd`/outdir --jobStore `pwd`/JobStore --workDir work mutli.cwl job.ymland get the following error:
=========> Failed job 'file:///users/sphe/cwl-crash/touchfile.cwl' touch T/9/jobyXgJmQ
INFO:toil.worker:---TOIL WORKER OUTPUT LOG---
INFO:toil:Running Toil version 3.15.0-0e3a87e738f5e0e7cff64bfdad337d592bd92704.
Got workflow error
Traceback (most recent call last):
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/executors.py", line 100, in run_jobs
for r in jobiter:
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/command_line_tool.py", line 382, in job
builder.pathmapper = self.makePathMapper(reffiles, builder.stagedir, **make_path_mapper_kwargs)
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 233, in makePathMapper
get_file=kwargs["toil_get_file"])
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 196, in __init__
stagedir, separateDirs=separateDirs)
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 218, in __init__
self.setup(dedup(referenced_files), basedir)
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/cwltool/pathmapper.py", line 271, in setup
self.visit(fob, stagedir, basedir, copy=fob.get("writable"), staged=True)
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 217, in visit
resolved = self.get_file(loc) if self.get_file else loc
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/cwl/cwltoil.py", line 268, in toilGetFile
srcPath = fileStore.readGlobalFile(fileStoreID[7:])
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/fileStore.py", line 1659, in readGlobalFile
self.jobStore.readFile(fileStoreID, localFilePath, symlink=symlink)
File "/users/sphe/spiel/.virtualenv/local/lib/python2.7/site-packages/toil/jobStores/fileJobStore.py", line 309, in readFile
os.link(jobStoreFilePath, localFilePath)
OSError: [Errno 1] Operation not permitted
But not if I don't specify --workDir
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-262
I get the same issue. Has this been addressed? It seems I have toil version 3.14.0.
We substantially revised how we attempt to make hard links from the fileJobStore in Toil 3.20.0. It looks like we don't handle [Errno 1] Operation not permitted
specifically, but we use a more robust approach and we print some important debugging info (i.e. what it is trying to link to where) that could be used to solve the problem.
@Tierhon Could you retry with Toil 3.20.0 or later?
If that doesn't help, we could add this error to the ones that mean a hard link just isn't possible, rather than that something has gone wrong in Toil's internals. It might be that you happen to have a file system where for whatever reason hardlinking can be disallowed by permissions. Do you happen to be working with input files that aren't owned by the user running Toil?
@adamnovak I'm updating to 3.24.0 now, and restarting cactus, so hopefully that will work.
The filesystem is BeeGFS. It seems that versions previous to 7 could only create hardlinks in the same folder, and not across folders. We have 7.1.3 which should not have that limitation. Fingers crossed.
Ole
@adamnovak At least I do not get this specific error anymore, but another: "TypeError: _runner() got an unexpected keyword argument 'defer'" This is the same as #2854 I guess.
Our computing cluster do not have Docker or Singularity set up, but I was glad when I saw cactus in conda. However, that version seems a bit outdated, and I am not really sure I would install cactus all on my own. The cluster should implement Docker/Singularity soon, and I guess I can revisit cactus then.
Ole
OK. Sorry Cactus is giving you trouble now.
I'll change this issue to track that some filesystems exist with this weird hard link limitation; I wasn't aware of that. We probably need to change the code here to account for permission or whatever other weird errors they raise to enforce that and fall back on copying in those cases instead of complaining that something is wrong:
https://github.com/DataBiosphere/toil/blob/993be0c3d95c83ca969d783e708c48d918407b16/src/toil/jobStores/fileJobStore.py#L442-L460
I've been running the precompiled Cactus binary on an HPC and running into the same os.link
error described above.
Batch system: slurm 22.05.8 Filesystem: BeeGFS 7.2.9 Python 3.9.12 Toil 5.9.2 Cactus 2.5.0
BeeGFS still doesn't support hardlinking across different directories, which yields OSError PermissionError (EPERM)
. This error should be caught by the snippet from toil.jobStores.fileJobStore read_file
linked above, but is not properly handled if it first encounters a temp file that already exists. Instead, the first EEXIST
error is caught, but the next os.link(jobStoreFilePath, localFilePath)
line can still raise a fatal EPERM
error if hardlinking across directories is not allowed.
Example error log:
Traceback (most recent call last):
File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 533, in read_file
os.link(jobStoreFilePath, local_path)
FileExistsError: [Errno 17] File exists: '/scratch2/cactus_jobstore/files/for-job/kind-LastzRepeatMaskJob/instance-4r7rrzq9/file-661334c1003a400d88b74450f4dfd269/Saccharina_latissima_0.maskedQeury' -> '/scratch2/cactus_tmp/76f125d1e7f95a3497d84aebb1968795/7e01/f1ea/tmp958sbruu.tmp'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "~/cactus_env/lib/python3.9/site-packages/toil/worker.py", line 403, in workerScript
job._runner(jobGraph=None, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "~/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 908, in _runner
super(RoundedJob, self)._runner(*args, jobStore=jobStore,
File "~/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2743, in _runner
returnValues = self._run(jobGraph=None, fileStore=fileStore)
File "~/cactus_env/lib/python3.9/site-packages/toil/job.py", line 2660, in _run
return self.run(fileStore)
File "~/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in run
chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
File "~/cactus_env/lib/python3.9/site-packages/cactus/preprocessor/cactus_preprocessor.py", line 117, in <listcomp>
chunkList = [readGlobalFileWithoutCache(fileStore, fileID) for fileID in self.chunkIDList]
File "~/cactus_env/lib/python3.9/site-packages/cactus/shared/common.py", line 918, in readGlobalFileWithoutCache
fileStore.jobStore.readFile(jobStoreID, f)
File "~/cactus_env/lib/python3.9/site-packages/toil/lib/compatibility.py", line 12, in call
return func(*args, **kwargs)
File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/abstractJobStore.py", line 1273, in readFile
return self.read_file(jobStoreFileID, localFilePath, symlink)
File "~/cactus_env/lib/python3.9/site-packages/toil/jobStores/fileJobStore.py", line 543, in read_file
os.link(jobStoreFilePath, local_path)
PermissionError: [Errno 1] Operation not permitted: '/scratch2/cactus_jobstore/files/for-job/kind-LastzRepeatMaskJob/instance-4r7rrzq9/file-661334c1003a400d88b74450f4dfd269/Saccharina_latissima_0.maskedQeury' -> '/scratch2/cactus_tmp/76f125d1e7f95a3497d84aebb1968795/7e01/f1ea/tmp958sbruu.tmp'
[2023-04-04T20:15:46-0700] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host ...
In case anyone else is still dealing with this issue in 2023, this is the edit to fileJobStore.py
(starting at line 532 in the read_file
definition) that worked for me:
try:
os.link(jobStoreFilePath, local_path)
# It worked!
return
except OSError as e:
# For the list of the possible errno codes, see: https://linux.die.net/man/2/link
if e.errno == errno.EEXIST:
# Overwrite existing file, emulating shutil.copyfile().
os.unlink(local_path)
# It would be very unlikely to fail again for same reason but possible
# nonetheless in which case we should just give up.
try:
os.link(jobStoreFilePath, local_path)
# Now we succeeded and don't need to copy
return
except:
# Handles BeeGFS error where hardlinking between directories is not permitted
pass
elif e.errno == errno.EXDEV:
...