spades icon indicating copy to clipboard operation
spades copied to clipboard

Mismatch between segment sequence and contig sequence

Open Martin-lc opened this issue 5 months ago • 1 comments

Description of bug

From contigs.path, I have: NODE_10581_length_541_cov_2.942953 503304+

From contigs.fasta, I have:

NODE_10581_length_541_cov_2.942953 TAATATGAGTACCGCTCTTTATTTTGTAGTAAACGATGCTTTTTATCGTAAGGCTTTGCC GAAGGCACAATTGCCGGAAGGAGTGTTGCTTGGCAGCTTGAAAGAGCTGTCTGAACAATA TCCGGCTTTGGTCAAGCAGTATTATGGCAAGTTGGCAGATACTTCCAAGGATGGGGTGAC CGCCTTCAATAATACTTTTGCCCAGGATGGCTTTATGCTGTATGTGCCGAAAGGCGTGGT GGTGGACAAACCCATTCAACTGGTGAACATATTGCGTGCTGATGTTAATTTTATGGTGAA CCGCCGTGTGCTGGTTGTGCTGGAAGAAGGTGCGCAGGCTCGTCTGTTGATTTGTGATCA TGCCATGGATAATGTAAATTTCCTTTCTACTCAGGTTATTGAGGTCTTTGCAAAAGAAAA TGCTACTTTCGATCTTTATGAACTGGAAGAAACCCATACCAGCACAGTGCGTTTCAGTAA CCTCTATGTGAACCAGGAGGCAGACAGTAATGTGCTTTTGAATGGTATGACTTTGCATAA C

From assembly_graph_after_simplification.gfa, I have: S 503304 TGATCTCAGCTCCACGTCCGGCCAAAGTTACTTCAGTTGTATTACGCGTAGTACCTAATATGAGTACCGCTCTTTATTTTGTAGTAAACGATGCTTTTTATCGTAAGGCTTTGCCGAAGGCACAATTGCCGGAAGGAGTGTTGCTTGGCAGCTTGAAAGAGCTGTCTGAACAATATCCGGCTTTGGTCAAGCAGTATTATGGCAAGTTGGCAGATACTTCCAAGGATGGGGTGACCGCCTTCAATAATACTTTTGCCCAGGATGGCTTTATGCTGTATGTGCCGAAAGGCGTGGTGGTGGACAAACCCATTCAACTGGTGAACATATTGCGTGCTGATGTTAATTTTATGGTGAACCGCCGTGTGCTGGTTGTGCTGGAAGAAGGTGCGCAGGCTCGTCTGTTGATTTGTGATCATGCCATGGATAATGTAAATTTCCTTTCTACTCAGGTTATTGAGGTCTTTGCAAAAGAAAATGCTACTTTCGATCTTTATGAACTGGAAGAAACCCATACCAGCACAGTGCGTTTCAGTAACCTCTATGTGAACCAGGAGGCAGACAGTAATGTGCTTTTGAATGGTATGACTTTGCATAACGGTACTACGCGTAATACAACTGAAGTAACTTTGGCCGGACGTGGAGCTGAGATCA DP:f:2.94295 KC:i:1754

As you can see, the first 55 bp of segment 503304 is omited from the contig 10581 which is solely made from the segment 503304.

Is this a bug? This can be a significant issue for some files, spanning for ~2% of the contigs (>500bps) produced.

spades.log

spades.log

params.txt

params.txt

SPAdes version

SPAdes v4.0.0

Operating System

Red Hat Enterprise Linux 9.0

Python Version

Python 3.13.0

Method of SPAdes installation

conda

No errors reported in spades.log

  • [x] Yes

Martin-lc avatar Jul 09 '25 19:07 Martin-lc

assembly_graph_after_simplification.gfa is a de Bruijn graph, so you're having overlaps of k=55 of the edges. However, you cannot emit all these overlaps as this would cause significant sequence duplication (e.g. for a typical fork A -> B, A -> C there are only two overlaps).

Contig emission takes this into account and tries to resolve these overlaps thus reducing sequence duplication. To put things simple: this 55 bp piece is not missed, it is just a part of another contig (adjacent to this one).

asl avatar Jul 09 '25 19:07 asl