fetchngs
fetchngs copied to clipboard
Wrong use of --progress in prefetch command leads to failure of pipeline
Description of the bug
I am currently trying to retrieve 15k samples from SRA. Since FTP download was painfully slow due to an apparent connection limit I defaulted to use the sra-toolkit via --force_sratools_download. However, this led to an immediate failure of the pipeline at the prefetch process stage since it just produced retries and ultimately aborted execution due to the retry limit. After a bit of digging through the pipelines code I found that the prefetch process is using the following command
prefetch --progress <sra_accession>
Although --progress is a valid command line parameter it's use in this case is wrong. --progress expects an integer value denoting the time scale to use for displaying the progress of the download. Since there is no such value in this command it consumes the actual accession subsequently is not available for the tool to fetch the data which leads to failure. Here is the relevant part of the help message
-p|--progress <value> time period in minutes to display download
progress (0: no progress), default: 1
I would thus recommend to either remove the --progress argument or add a value.
Command used and terminal output
I ran the pipeline with `--force_sratools_download` and a simple modification where I added `echo $output` to the `retry_with_backoff.sh` script in the `retry_with_backoff` function to see what was wrong and got the following output in any of the .command.log files
prefetch --progress DRR078852
Usage: prefetch [options] <SRA accession | kart file> [...] Download SRA or dbGaP files and their dependencies prefetch [options] <SRA file> [...] Check SRA file for missed dependencies and download them prefetch --list <kart file> [...] List the content of a kart file
Failed attempt 1 of 5. Retrying in 1 s.
prefetch --progress DRR078852
Usage: prefetch [options] <SRA accession | kart file> [...] Download SRA or dbGaP files and their dependencies prefetch [options] <SRA file> [...] Check SRA file for missed dependencies and download them prefetch --list <kart file> [...] List the content of a kart file
Failed attempt 2 of 5. Retrying in 2 s.
prefetch --progress DRR078852
Usage: prefetch [options] <SRA accession | kart file> [...] Download SRA or dbGaP files and their dependencies prefetch [options] <SRA file> [...] Check SRA file for missed dependencies and download them prefetch --list <kart file> [...] List the content of a kart file
Failed attempt 3 of 5. Retrying in 4 s.
prefetch --progress DRR078852
Usage: prefetch [options] <SRA accession | kart file> [...] Download SRA or dbGaP files and their dependencies prefetch [options] <SRA file> [...] Check SRA file for missed dependencies and download them prefetch --list <kart file> [...] List the content of a kart file
Failed attempt 4 of 5. Retrying in 8 s.
prefetch --progress DRR078852
Usage: prefetch [options] <SRA accession | kart file> [...] Download SRA or dbGaP files and their dependencies prefetch [options] <SRA file> [...] Check SRA file for missed dependencies and download them prefetch --list <kart file> [...] List the content of a kart file
Failed after 5 attempts.
Relevant files
No response
System information
No response
Thanks for the report @dmalzl, will add a fix immediately. I will just remove the flag since most of the time nobody will follow the log output anyway.
On the same issue, there seems to be a problem with configuration of the sra-toolkit as I get this message after resolving the command line parameter issue
This sra toolkit installation has not been configured.
Before continuing, please run: vdb-config --interactive
For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/
after which it retries and fails
I had a look and I don't see the same thing when running prefetch from the container. What version of sra-tools are you running?
Output that I see:
prefetch --help
-p|--progress Show progress
"prefetch" version 2.11.0
I'm happy to remove the hardcoded option, though, since it doesn't really serve a purpose.
On the configuration: There is code in the module that generates a configuration for you. Why is that not working for you? Please open a separate issue on that or join us on slack to discuss.
Ah sorry this is my fault. I had the configuration problem before and therefore used the preinstalled module I have on our cluster which is v2.9.6.1. Didn't anticipate the version difference. So I think this was then just a problem on my side. However, the problem with the configuration is unchanged. It now runs with my version but I am really curious why configuration is not working for me. Would you suggest opening another issue here or directly go to slack? Which one would be better for discussion?
I think Slack will be easier if you don't mind there's a fetchngs channel.
Sorry for being annoying but I joined and don't find the fetchngs channel. Just have pipelines is this the one you were referring to?
No worries found it
Last issue connected to this I promise because I fixed it this way. After fixing my ncbi config with the two missing lines
/LIBS/GUID = "bd94a196-a984-494f-909e-14e1ffb250e4"
/libs/cloud/report_instance_identity = "true"
I retried and it worked. To reset the pipeline to the state it was in I checked out all my changes and tried again just to be sure the issue doesn't seem to be fixed just because I worked my voodoo on the code. Unfortunately, this was the case with the reason for this being that the stupid prefetch is not just writing the file in the cwd but rather in some other directory specified in the ncbi_config. Simply adding -o ./$id to the command fixed this. So the complete change to the module should be
retry_with_backoff.sh prefetch \\
$args \\
$id \\
-o ./$id
This will also prevent others to run into the same issues as I experienced today
Although I don't know if prepending $id with ./ is necessary since this is something I had to do with v2.9.6.1 but v2.11 does not complain about it
Since this seems specific to your case, i.e., the content of your NCBI configuration, I suggest you make use of the args with a local configuration.
process {
withName: SRATOOLS_PREFETCH {
ext.args = { "-o ./$id" }
}
}
Ah that's clever. Thanks for the suggestion
I have identified yet another possible problem. Using the -o ./$id forces the pipeline, at least in my case, to save the data into the cwd of the current process. However, the file is then named $id which is subsequently passed on to fasterq-dump. But passing just a plain SRA id to fasterq-dump will result in fasterq-dump downloading the SRA file again instead of just processing the already existing one. I tried with suffixing the additional arg wit.sra but then the pipeline tells me it could not find the expected output since the output it expects from prefetch is a file named just with an SRA id.
I fixed this by changing the output to path("${id}.sra") which seems to work fine until the process completes where it starts to throw an error but i didn't have the time to look into this yet. In any way I think this should be changed to incorporate the -o argument to force the output to have the .sra suffix in order to avoid futile downloads and data accumulation
The default behavior is that prefetch creates a directory which contains the SRA so $id/$id.sra if you will. This directory is then supplied to fasterq-dump which will look for such a directory. I hadn't used the -o flag before and didn't look at it specifically.
I suggest you use either -O . / --output-directory . to force output in the current directory or -o $id/$id.sra / --output-file $id/$id.sra. Then everything should work as expected.
Okay. Again something I didn't know. I now just deleted my ncbi config file to let the pipeline take care of it. Hope that settles it now. Thanks
I'm working on some improvements for the pipeline based on your troubles but it will still take a while until they are released.
No worries. Since I was not really dependent on the configurations in my file simply deleting it seems to have done the trick for now. But I guess at least some future user will profit from my experiences. Thanks for taking care
Looks like this has been resolved. Will close for now but feel free to re-open if the problem persists.