rnaseq
rnaseq copied to clipboard
salmon fails to create index if reference fasta file contains comments in the header
Description of the bug
A fasta header can contain comments together with the name of the contig. Example:
>HLA-DRB1*16:02:01 HLA00878
The corresponding line in the decoy.txt file to be passed to salmon index would be
HLA-DRB1*16:02:01 HLA00878
the problem is that the comment is interpreted by salmon as an extra decoy, while creating the index, and it stops with an error. In my case:
[2022-09-05 14:58:36.028] [puff::index::jointLog] [critical] The decoy file contained the names of 3892 decoy sequences, but 3367 were matched by sequences in the reference file provided. To prevent unintentional errors downstream, please ensure that the decoy file exactly matches with the fasta file that is being indexed.
[2022-09-05 14:58:36.424] [puff::index::jointLog] [error] The fixFasta phase failed with exit code 1
An additional cleaning step in rnaseq/modules/nf-core/modules/salmon/index/main.nf would fix the issue. What I propose is to replace line 31:
sed -i.bak -e 's/>//g' decoys.txt
with
mv decoys.txt decoys.txt.bak
awk '{print $1}' decoys.txt.bak > decoys.txt
sed -i -e 's/>//g' > decoys.txt
Command used and terminal output
No response
Relevant files
No response
System information
No response