CWL: Too many bind mounts in Singularity.
Discussion and original issue is here: https://cwl.discourse.group/t/too-many-arguments-on-the-command-line/248/2
Short summary: If I use as input to a step and I put a directory array (Directory[]) and I run the workflow with singularity it happens that if the file list is too long I get a Too many arguments on the command line. I am currently running the workflow with Toil.
[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] The job seems to have left a log file, indicating failure: 'file:///project/astroneosc/Software/prefactor3-cwl/lofar-cwl/steps
/check_ateam_separation.cwl#check_ateam_separation' python3 /usr/local/bin/check_Ateam_separation.py kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_se
paration.cwl_check_ateam_separation/instance-r0brc7sq
[2020-12-02T11:21:48+0100] [MainThread] [W] [toil.leader] Log from job kind-file_project_astroneosc_Software_prefactor3-cwl_lofar-cwl_steps_check_ateam_separation.cwl_check_ateam_separ
ation/instance-r0brc7sq follows:
=========>
/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpsvnitjnw.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f4_TSM0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpfw7whow7.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp1o7zp770.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp8clmv0ww.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/DATA_DESCRIPTION/table.dat:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpj09nien5.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/table.f0:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmpnmj0wmwj.tmp
:/var/lib/cwl/stga6073e5c-0ba5-472f-9790-6480440e0258/L755125_SB222_uv.MS/QUALITY_FREQUENCY_STATISTIC/table.info:ro \
--bind \
/project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6df42315/tmp3xtw2kh7.tmp
[...]
--pwd \
/vWWYEQ \
/project/astroneosc/Software/prefactor3.simg \
python3 \
/usr/local/bin/check_Ateam_separation.py \
/var/lib/cwl/stg7dbf09a8-5fa9-48ed-b4c2-fe2eb29f266a/L755125_SB000_uv.MS \
/var/lib/cwl/stgf721eb4f-99dd-4b87-8fe6-9e250a7317ce/L755125_SB002_uv.MS \
/var/lib/cwl/stg5b01d833-66e4-4dcd-9877-d33c7b7cd5b9/L755125_SB005_uv.MS \
/var/lib/cwl/stg55dc76f5-ce12-462c-bffd-1f6d2e4d66bb/L755125_SB004_uv.MS \
/var/lib/cwl/stgf8d851aa-9737-4120-af95-241e53ef984b/L755125_SB006_uv.MS \
/var/lib/cwl/stg2d1ebe28-a26f-4442-a4f7-e9fb177653fe/L755125_SB007_uv.MS \
/var/lib/cwl/stgf4f9f1b2-855c-433f-87fc-f9b57e69f060/L755125_SB013_uv.MS \
/var/lib/cwl/stgfbef4e7b-b90e-49eb-bb24-bc9074dec3ef/L755125_SB010_uv.MS \
[...]
--min_separation \
30 \
--outputimage \
Ateam_separation.png > /project/astroneosc/Data/tmp/node-70e26f65-197b-49f9-90aa-52b42e8d7822-4b184c8e-e9fd-4784-92c4-5ace3fd7ef2c/tmp074diqpq/31cdf995-536c-4d07-9b48-c72e6
df42315/tu3w97mbq/tmp-outcd3vx2cf/Ateam_separation.log
[2020-12-02T11:21:43+0100] [MainThread] [E] [cwltool] Exception while running job
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 394, in _execute
default_stderr=runtimeContext.default_stderr,
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/cwltool/job.py", line 955, in _job_popen
universal_newlines=True,
File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
restore_signals, start_new_session)
File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'singularity'
[2020-12-02T11:21:43+0100] [MainThread] [W] [cwltool] [job check_ateam_separation] completed permanentFail
[2020-12-02T11:21:45+0100] [MainThread] [W] [toil.fileStores.abstractFileStore] LOG-TO-MASTER: Job used more disk than requested. Consider modifying the user script to avoid th
e chance of failure due to incorrectly requested resources. Job files/for-job/kind-CWLWorkflow/instance-rcfqyxlv/cleanup/file-5wk8511s/stream used 2725.25% (81.8 GB [87786401792B] used
, 3.0 GB [3221225472B] requested) at the end of its run.
Traceback (most recent call last):
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/worker.py", line 368, in workerScript
job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore, defer=defer)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1424, in _runner
returnValues = self._run(jobGraph, fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/job.py", line 1361, in _run
return self.run(fileStore)
File "/home/astroneosc-mmancini/.local/lib/python3.6/site-packages/toil/cwl/cwltoil.py", line 988, in run
raise cwltool.errors.WorkflowException(status)
cwltool.errors.WorkflowException: permanentFail
[2020-12-02T11:21:45+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host wn-db-02.novalocal
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-738
Some way of invoking singularity with an arbitrary number of binds
I think the SINGULARITY_BIND option recommended by @matmanc should work. Here's a first attempt at that: https://github.com/common-workflow-language/cwltool/pull/1386
@DailyDreaming it was pointed out by @tetron that my approach won't work, it will run into the same E2BIG: https://github.com/common-workflow-language/cwltool/pull/1386#issuecomment-739333597
A better solution overall would be to create a hardlink tree so that only the base directory needs to be mounted into the container.
@mr-c @tetron I'm testing on a scaled down version of mantmanc's example test https://cwl.discourse.group/t/too-many-arguments-on-the-command-line/248/20 (changing for i in {1..2024} -> for i in {1..20} in create_file.cwl).
cwltool seems to run this successfully in about a minute, while toil takes 24 minutes to fail.
Toil adding batching by directory may help, and makes sense to me. I brought this up with @adamnovak , and he said that they already batch directories for toil in vg and sent the following link: https://github.com/vgteam/toil-vg/blob/295ea704cf64e8673a21a04fcf063ce0ee08d29f/src/toil_vg/iostore.py#L79
I think we should try to skip the compression, but I think implementing directory batching may help solve both:
- Speeding up toil's import of overly populous directories.
- Submitting a less verbose arg set of bind mounts.
I think on both the toil and cwltool side, we should attempt to un-restrict the 8mb heap limit when allowed to do so and thus the 1/4 limit size (2mb) to CLI commands, especially since @tetron mentioned his research lead him to believe that the env vars share the same memory space: https://github.com/common-workflow-language/cwltool/pull/1386#issuecomment-739333597 .
Seems like tarring it up and then untarring it later amounts to the same amount of I/O as copying all the files out of file store to reconstruct the directory. I don't think using a tar file is a good general solution. Copying into a temporary directory tree is easy, it use hard links or symlinks if we want to get a bit more clever.
On Tue, Dec 8, 2020, at 7:15 PM, Lon Blauvelt wrote:
@mr-c https://github.com/mr-c @tetron https://github.com/tetron I'm testing on a scaled down version of mantmanc's example test https://cwl.discourse.group/t/too-many-arguments-on-the-command-line/248/20 (changing
for i in {1..2024}->for i in {1..20}increate_file.cwl).
cwltoolseems to run this successfully in about a minute, whiletoiltakes 24 minutes to fail.
Toil adding batching by directory may help, and makes sense to me. I brought this up with @adamnovak https://github.com/adamnovak , and he said that they already batch directories for toil in
vgand sent the following link: https://github.com/vgteam/toil-vg/blob/295ea704cf64e8673a21a04fcf063ce0ee08d29f/src/toil_vg/iostore.py#L79
I think we should try to skip the compression, but I think implementing directory batching may help solve both:
- Speeding up toil's import of overly populous directories.
- Submitting a less verbose arg set of bind mounts. I think on both the
toilandcwltoolside, we should attempt to un-restrict the 8mb heap limit when allowed to do so and thus the 1/4 limit size (2mb) to CLI commands, especially since @tetron https://github.com/tetron mentioned his research lead him to believe that the env vars share the same memory space: common-workflow-language/cwltool#1386 (comment) https://github.com/common-workflow-language/cwltool/pull/1386#issuecomment-739333597 .
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DataBiosphere/toil/issues/3358#issuecomment-741293001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKBOBBAQF5K6FAEJK44LH3ST26TJANCNFSM4UN7RT2Q.
@tetron Hmmm... I'll defer to your intuition, and I agree that if it were simply writing the tar vs. writing the recursive directory, it would be roughly the same io. I was just thinking that in Toil/python, we size each file and convert to a FileID object, before writing each file to the jobstore individually and I was hoping that dropping that overhead (especially sizing the individual files) might improve things.
If I understand correctly, the temporary directory tree would only need to be implemented on the cwltool side? I assume this would look something like (very roughly):
associations = dict()
associated_tmp_dirs = dict()
for src, dst, read_write in locations:
# make a unique tmpdir for each basedir being mounted
if os.path.basedir(src) not in associated_tmp_dirs:
temp_dir = mktempdir()
associated_tmp_dirs[os.path.basedir(src)] = temp_dir
else:
temp_dir = associated_tmp_dirs[os.path.basedir(src)]
associations[src] = dict('src_dir': temp_dir,
'dst': dst)
if f'{os.path.basedir(src)}:{temp_dir}:{read_write}' not in already_existing_bind_mount_args:
add_bind_mount(f'{os.path.basedir(src)}:{temp_dir}:{read_write}')
run_hard_link_from_tmp_dir_to_real_locations_inside_of_container(associations)
I'll try to open a PR to this effect.
@tetron Will try to push the PR sometime tomorrow. Right now I'm attempting to group files with a common basedir together, create a tempdir, hardlink the files into the tempdir, and then bind mount a minimal set of tempdirs (and original dirs if they weren't files) to the file directories that Singularity originally wanted to find the files in.
We have hit this issue as well, and I was wondering if it might be easier to bind the whole jobstore and workdir folders, instead of binding each file inside of them individually.