rnaseq icon indicating copy to clipboard operation
rnaseq copied to clipboard

salmon fails to create index if reference fasta file contains comments in the header

Open paoloAngelino opened this issue 2 years ago • 0 comments

Description of the bug

A fasta header can contain comments together with the name of the contig. Example: >HLA-DRB1*16:02:01 HLA00878

The corresponding line in the decoy.txt file to be passed to salmon index would be HLA-DRB1*16:02:01 HLA00878

the problem is that the comment is interpreted by salmon as an extra decoy, while creating the index, and it stops with an error. In my case:

[2022-09-05 14:58:36.028] [puff::index::jointLog] [critical] The decoy file contained the names of 3892 decoy sequences, but 3367 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed.
[2022-09-05 14:58:36.424] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1

An additional cleaning step in rnaseq/modules/nf-core/modules/salmon/index/main.nf would fix the issue. What I propose is to replace line 31: sed -i.bak -e 's/>//g' decoys.txt with

mv decoys.txt decoys.txt.bak
awk '{print $1}' decoys.txt.bak > decoys.txt 
sed -i -e 's/>//g' > decoys.txt

Command used and terminal output

No response

Relevant files

No response

System information

No response

paoloAngelino avatar Sep 30 '22 12:09 paoloAngelino