nextflow
nextflow copied to clipboard
GCS doesn't support symlinks, so Google Batch executor should use hardlinks
Bug report
(Please follow this template replacing the text between parentheses with the requested information)
Expected behavior and actual behavior
When running a pipeline (in this instance nf-core/quantms) using input files hosted on GCS and a workdir on GCS, Nextflow attempts to stage the files using a symlink, but GCS doesn't seem to support them.
Steps to reproduce the problem
Using the nf-core/quantms pipeline and test_dia profile, tasks 1 and 2 work fine.
Now try hosting the seed file from the test_dia profile on GCS and run the pipeline with it as the input, task 1 completes, but task 2 fails. The error report states that the input file for task 2 is missing.
Adding in stageInMode = 'link'
fixes the issue.
Program output
Error executing process > 'NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING (PXD026600.sdrf.tsv)'
Caused by:
Process `NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING (PXD026600.sdrf.tsv)` terminated with an error exit status (1)
Command executed:
## -t2 since the one-table format parser is broken in OpenMS2.5
## -l for legacy behavior to always add sample columns
parse_sdrf convert-openms \
-t2 -l \
--extension_convert raw:mzML,.gz:,.tar.gz:,.tar:,.zip: \
-s PXD026600.sdrf.tsv \
\
2>&1 | tee PXD026600.sdrf_parsing.log
mv openms.tsv PXD026600.sdrf_config.tsv
mv experimental_design.tsv PXD026600.sdrf_openms_design.tsv
cat <<-END_VERSIONS > versions.yml
"NFCORE_QUANTMS:QUANTMS:CREATE_INPUT_CHANNEL:SDRFPARSING":
sdrf-pipelines: $(parse_sdrf --version 2>&1 | awk -F ' ' '{print $2}')
END_VERSIONS
Command exit status:
1
Command output:
OpenMS().openms_convert(sdrf, onetable, legacy, verbose, conditionsfromcolumns, extension_convert)
File "/usr/local/lib/python3.11/site-packages/sdrf_pipelines/openms/openms.py", line 242, in openms_convert
sdrf = pd.read_table(sdrf_file)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1282, in read_table
return _read(filepath_or_buffer, kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 611, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1448, in __init__
self._engine = self._make_engine(f, self.engine)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/io/parsers/readers.py", line 1705, in _make_engine
self.handles = get_handle(
^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pandas/io/common.py", line 863, in get_handle
handle = open(
^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'PXD026600.sdrf.tsv'
Environment
- Nextflow version: 23.10.1 build 5891
- Operating system: Linux
Launched from Tower to Google Batch
Additional context
(Add any other context about the problem here)
Update: It looks like I forgot to push the stageInMode = 'link'
commit when before I did my final test, so instead I tested that stageInMode = 'copy'
fixes the issue.
It looks like gcsfuse has supported symlinks for a while: https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?
The staging command for google batch is defined here:
https://github.com/nextflow-io/nextflow/blob/82de4bfe726da274999cb6a5e666320df2a6f18d/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy#L216-L231
@harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al
in the task script. I'm guessing that symlink'ed files are just not showing up for some reason
Sorry, I am a little confused by your ask, I am using a GCS bucket as the workdir (as per the documentation) so I and I am not sure how I how I would catch the worker node in time to ssh in time before it crashes.
Are you asking for the GCS directory listing?
On Mon, Mar 25, 2024, 6:09 AM Ben Sherman @.***> wrote:
It looks like gcsfuse has supported symlinks for a while: GoogleCloudPlatform/gcsfuse#12 https://github.com/GoogleCloudPlatform/gcsfuse/issues/12 . Maybe there was a regression?
The staging command for google batch is defined here:
https://github.com/nextflow-io/nextflow/blob/82de4bfe726da274999cb6a5e666320df2a6f18d/modules/nextflow/src/main/groovy/nextflow/executor/SimpleFileCopyStrategy.groovy#L216-L231
@harper357 https://github.com/harper357 can you give me a directory listing for a task directory when using symlink vs link? Just do an ls -al in the task script. I'm guessing that symlink'ed files are just not showing up for some reason
— Reply to this email directly, view it on GitHub https://github.com/nextflow-io/nextflow/issues/4845#issuecomment-2017974743, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4ZB2U44HIL23M7FQH34TY2AOYNAVCNFSM6AAAAABFEAJKWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMJXHE3TINZUGM . You are receiving this because you were mentioned.Message ID: @.***>
I was thinking to just add ls -al
to the process script, then you should see the directory listing in the error message you showed
Task 1 (the task that works):
total 20
drwx------ 2 root root 4096 Mar 25 17:26 .
drwxrwxrwt 1 root root 4096 Mar 25 17:26 ..
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.err
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.out
lrwxrwxrwx 1 root root 87 Mar 25 17:26 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.run
lrwxrwxrwx 1 root root 86 Mar 25 17:26 .command.sh -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/.command.sh
-rw-r--r-- 1 root root 0 Mar 25 17:26 .command.trace
lrwxrwxrwx 1 root root 93 Mar 25 17:26 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/[input_path]/PXD026600.sdrf.tsv
Task 2 (the task that crashes):
total 20
drwx------ 2 root root 4096 Mar 25 17:28 .
drwxrwxrwt 1 root root 4096 Mar 25 17:28 ..
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.err
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.out
lrwxrwxrwx 1 root root 87 Mar 25 17:28 .command.run -> /mnt/disks/[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.run
lrwxrwxrwx 1 root root 86 Mar 25 17:28 .command.sh -> /mnt/disks[private_bucket]/nextflow/work/ca/919bfbc47b8a94c479636857ed9b60/.command.sh
-rw-r--r-- 1 root root 0 Mar 25 17:28 .command.trace
lrwxrwxrwx 1 root root 93 Mar 25 17:28 PXD026600.sdrf.tsv -> /mnt/disks/[private_bucket]/nextflow/work/7a/2c6aede9e1e5e94c59563adce61992/PXD026600.sdrf.tsv
How is the [input_path]
different in the first example?
It is just a couple of subfolders on GCS. I doubled check that it is actually the correct path. In other words, in Task 1 it points to the file on GCS, in Task 2 it points to the symlink from Task 1.
Small correction, I never pushed the stageInMode='link'
, so the correction that worked for me was stageInMode='copy'
@hnawar @soj-hub Do either of you know anything about symlinks not working with gcsfuse? Could there be a regression?
@bentsherman - I'm not aware of a regression. This seems like something the GCS team should chime in on, so we'll try and loop them in.
We've confirmed that symlinks still work. Does this issue persist and can exact steps to reproduce the issue be shared?
Sorry, i have been very busy this week.
Like I said in the OP, I am using nf-core/quantms. If you run it with the test.config
, the input file (PXD026600.sdrf.tsv
) is remotely hosted (not on GS) and task 1 & task 2 work just fine.
If you instead use test.config
and overwrite the input file to a copy saved on GS, task 1 completes just fine, but task 2 fails.
I believe what is happening is the output to task 1 is an unmodified version of PXD026600.sdrf.tsv
, so in task 2 the symlink just points to the symlink from task 1.
From nf-core/quantms task 1 ( INPUT_CHECK):
input:
path input_file
val is_sdrf
output:
path "*.log", emit: log
path "${input_file}", emit: checked_file
path "versions.yml", emit: versions
From nf-core/quantms/quantms.nf:
INPUT_CHECK (
file(params.input)
)
ch_versions = ch_versions.mix(INPUT_CHECK.out.versions)
// TODO: OPTIONAL, you can use nf-validation plugin to create an input channel from the samplesheet with Channel.fromSamplesheet("input")
// See the documentation https://nextflow-io.github.io/nf-validation/samplesheets/fromSamplesheet/
// ! There is currently no tooling to help you write a sample sheet schema
//
// SUBWORKFLOW: Create input channel
//
CREATE_INPUT_CHANNEL (
INPUT_CHECK.out.ch_input_file,
INPUT_CHECK.out.is_sdrf
)
I'm also seeing this problem in GCS when a process depends on a file from a previous process which in turn made a symlink to the file on a mounted drive. The work around of setting stageInMode
to copy
on the earlier process worked as a fix for me, but it would be nice not have to do this.
Also happening with me just as @archmageirvine describes. I will try stageInMode
copy
.
I tried stageInMode copy
and it worked for me as well. Another workaround that doesn't require copying all the data is to reintroduce the GCS paths with each process that uses them instead of piping them from one process to the next. For example, to join back in reference genome data I did this instead of using the existing reference symlinks from the previous process:
// Adding reference sequences again for Nextflow GCS symlink bug
ch_ref_by_genus = channel.of(
["Arabidopsis", "5", "arabidopsis_genome_id", "gs://genome/path/arabidopsis_genome_id.fasta"],
["Solanum", "12", "solanum_genome_id", "gs://genome/path/solanum_genome_id.fasta"]
)
.map {
tuple(
it[0], it[3], it[3] + ".amb", it[3] + ".ann", it[3] + ".bwt", it[3] + ".pac", it[3] + ".sa", it[3] + ".fai"
)
}
ch_bwa_mem_with_ref = ch_bwa_mem.combine(ch_ref_by_genus, by: 0)
// Running modules that need the ref (and can't use the ref from the first module because of the bug)
ch_bedtools_coverage = run_bedtools_coverage(ch_bwa_mem_with_ref)