Mismatch between segment sequence and contig sequence
Description of bug
From contigs.path, I have: NODE_10581_length_541_cov_2.942953 503304+
From contigs.fasta, I have:
NODE_10581_length_541_cov_2.942953 TAATATGAGTACCGCTCTTTATTTTGTAGTAAACGATGCTTTTTATCGTAAGGCTTTGCC GAAGGCACAATTGCCGGAAGGAGTGTTGCTTGGCAGCTTGAAAGAGCTGTCTGAACAATA TCCGGCTTTGGTCAAGCAGTATTATGGCAAGTTGGCAGATACTTCCAAGGATGGGGTGAC CGCCTTCAATAATACTTTTGCCCAGGATGGCTTTATGCTGTATGTGCCGAAAGGCGTGGT GGTGGACAAACCCATTCAACTGGTGAACATATTGCGTGCTGATGTTAATTTTATGGTGAA CCGCCGTGTGCTGGTTGTGCTGGAAGAAGGTGCGCAGGCTCGTCTGTTGATTTGTGATCA TGCCATGGATAATGTAAATTTCCTTTCTACTCAGGTTATTGAGGTCTTTGCAAAAGAAAA TGCTACTTTCGATCTTTATGAACTGGAAGAAACCCATACCAGCACAGTGCGTTTCAGTAA CCTCTATGTGAACCAGGAGGCAGACAGTAATGTGCTTTTGAATGGTATGACTTTGCATAA C
From assembly_graph_after_simplification.gfa, I have: S 503304 TGATCTCAGCTCCACGTCCGGCCAAAGTTACTTCAGTTGTATTACGCGTAGTACCTAATATGAGTACCGCTCTTTATTTTGTAGTAAACGATGCTTTTTATCGTAAGGCTTTGCCGAAGGCACAATTGCCGGAAGGAGTGTTGCTTGGCAGCTTGAAAGAGCTGTCTGAACAATATCCGGCTTTGGTCAAGCAGTATTATGGCAAGTTGGCAGATACTTCCAAGGATGGGGTGACCGCCTTCAATAATACTTTTGCCCAGGATGGCTTTATGCTGTATGTGCCGAAAGGCGTGGTGGTGGACAAACCCATTCAACTGGTGAACATATTGCGTGCTGATGTTAATTTTATGGTGAACCGCCGTGTGCTGGTTGTGCTGGAAGAAGGTGCGCAGGCTCGTCTGTTGATTTGTGATCATGCCATGGATAATGTAAATTTCCTTTCTACTCAGGTTATTGAGGTCTTTGCAAAAGAAAATGCTACTTTCGATCTTTATGAACTGGAAGAAACCCATACCAGCACAGTGCGTTTCAGTAACCTCTATGTGAACCAGGAGGCAGACAGTAATGTGCTTTTGAATGGTATGACTTTGCATAACGGTACTACGCGTAATACAACTGAAGTAACTTTGGCCGGACGTGGAGCTGAGATCA DP:f:2.94295 KC:i:1754
As you can see, the first 55 bp of segment 503304 is omited from the contig 10581 which is solely made from the segment 503304.
Is this a bug? This can be a significant issue for some files, spanning for ~2% of the contigs (>500bps) produced.
spades.log
params.txt
SPAdes version
SPAdes v4.0.0
Operating System
Red Hat Enterprise Linux 9.0
Python Version
Python 3.13.0
Method of SPAdes installation
conda
No errors reported in spades.log
- [x] Yes
assembly_graph_after_simplification.gfa is a de Bruijn graph, so you're having overlaps of k=55 of the edges. However, you cannot emit all these overlaps as this would cause significant sequence duplication (e.g. for a typical fork A -> B, A -> C there are only two overlaps).
Contig emission takes this into account and tries to resolve these overlaps thus reducing sequence duplication. To put things simple: this 55 bp piece is not missed, it is just a part of another contig (adjacent to this one).