fetchngs Deal with accessions with non-existing files

Description of feature

Hi,

In SRA some of the run accessions have no associated files. For example bioproject PRJEB18755 has several runs that are total ghosts: ERR2013571, ERR2013572, ERR2013573, ..., while other are fine. When these ghost accessions are provided in the input, the pipeline will first retry:

[60/81e7b9] NOTE: Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` failed -- Execution is retried (2)

and then terminate with errors:

Command error:
  [ERROR] There is no content for id ERR2013581. Maybe you lack the right permissions?

Of course one thing that can be done is to filter first these entries before feeding to the pipeline, but it will be great if these errors can be ignored. Or maybe there is an option like that already that I am missing? Thanks for any info on that, it will be extremely helpful to be able to easily deal with it!

Mar 24 '23 07:03 bmlab-sg

If you just want to ignore the errors, you can create a local nextflow configuration:

process {
  withName: SRA_IDS_TO_RUNINFO {
    errorStrategy = 'ignore'
  }
}

Mar 24 '23 09:03 Midnighter

Did this solution work for you @bmlab-sg ? We could try to incorporate ignoring these sorts of ids via the pipeline but we would need some sort of way to detect this via the metadata or otherwise.

Apr 25 '23 10:04 drpatelh

@drpatelh - yes, that solution mostly solves this issue. After looking at few datasets, seems like AvgSpotLen and/or Bases that are >0 can be a good filtering marker for these ghosts.

Apr 26 '23 03:04 bmlab-sg

Cool. Thanks for the update. We can see if these metadata fields are exposed so we can add conditional filtering to the pipeline in these scenarios so it doesn't hard fail.

Apr 26 '23 11:04 drpatelh

I am unable to reproduce this issue anymore. This could be due to the changes made to the ENA API recently as fixed in https://github.com/nf-core/fetchngs/pull/148

I am now getting [ERROR] No matches found for database id ERR2013613! and we are unable to retrieve any metadata via the API URL below which means we can't explicitly filter by Bases or otherwise: https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR2013613&result=read_run&fields=run_accession%2Cexperiment_accession

`ERR2013613`

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` terminated with an error exit status (1)

Command executed:

  echo ERR2013613 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013613.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013613!
  Line: 'ERR2013613'

`ERR2013581`

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)` terminated with an error exit status (1)

Command executed:

  echo ERR2013581 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013581.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013581!
  Line: 'ERR2013581'

Will close this issue for now but please feel free to re-open if you encounter the issue again along with providing the appropriate ids we can use to fix.

May 06 '23 14:05 drpatelh

Hello @drpatelh, Recently I encountered this issue while working on PRJNA1079722. Multiple runs in this project SRR29688921, SRR29688964, SRR29688955, SRR29688939, SRR29688945, SRR29688933, SRR29688921, SRR29688964 seem to cause this same error. However these dont seem to be "ghosts" as you found previously. Each of these runs seem to host data for the associated sample. Below is the error for one of these:

`ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955)'

Caused by: Process NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955) terminated with an error exit status (1)

Command executed:

echo SRR29688955 > id.txt sra_ids_to_runinfo.py
id.txt
SRR29688955.runinfo.tsv \

cat <<-END_VERSIONS > versions.yml "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO": python: $(python --version | sed 's/Python //g') END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: [ERROR] No matches found for database id SRR29688955! Line: 'SRR29688955' `

Jul 19 '24 07:07 rohitrrj

Hello @drpatelh, Recently I encountered this issue while working on PRJNA1079722. Multiple runs in this project SRR29688921, SRR29688964, SRR29688955, SRR29688939, SRR29688945, SRR29688933, SRR29688921, SRR29688964 seem to cause this same error. However these dont seem to be "ghosts" as you found previously. Each of these runs seem to host data for the associated sample. Below is the error for one of these:

`ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955)'

Caused by: Process NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955) terminated with an error exit status (1)

Command executed:

echo SRR29688955 > id.txt sra_ids_to_runinfo.py id.txt SRR29688955.runinfo.tsv \

cat <<-END_VERSIONS > versions.yml "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO": python: $(python --version | sed 's/Python //g') END_VERSIONS

Command exit status: 1

Command output: (empty)

Command error: [ERROR] No matches found for database id SRR29688955! Line: 'SRR29688955' `

I have encountered this same issue for the dataset PRJNA898600. I have also tried running just one sample from the project as well and in multiple ways (different identifiers: SRR22198886, SRS15675991, SRX18177158) and tried running the pipeline with ftp and sratools for -- download_method

The only variation I find is when I run with the ftp method, it technically completes the SRA_IDS_TO_RUNINFO process and fails at the SRA_RUNINFO_TO_FTP instead, but still has the underlying issue of not finding the dataset it seems (when exploring the "/work/" directory, the .runinfo.tsv is empty regardless of the way that I try to run the pipeline)

If it's helpful, here is the slight variation that I get with the FTP download method

Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP (1)'

Caused by:
  Missing output file(s) `*.tsv` expected by process `NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP (1)` (note: input files are not included in the default matching set)


Command executed:

  sra_runinfo_to_ftp.py \
      SRX18177158.runinfo.tsv \
      SRX18177158.runinfo_ftp.tsv
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_RUNINFO_TO_FTP":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  0

Command output:
  (empty)

Work dir:
  /data/user/lizzyr/setbp1_hd/src/work/5d/eb2b5e5c24efac0162b2aad382a315

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

This is with nf-core/fetchngs v1.12.0 and nextflow version 24.04.3

Jan 15 '25 20:01 lizzyjoan