Unicycler
Unicycler copied to clipboard
sequence present in 001_best_spades_graph through 004_bridges_applied lost in 005_final_clean
Hello-
We are using Unicycler (v0.4.4, mode = normal) on a project where so far out of roughly one hundred hybrid (Illumina paired end + ONT (some samples with 1D kit, others from rapid)) bacterial (~5Mb) genomes we find that the final Unicycler assembly is missing sequence that was present in earlier Unicycler stages. This has occurred in roughly 10% of the ~100 assemblies.
Here are some of the details of one such case.
The final Unicycler assembly is a single contig but not circular:
1 length=4797718 depth=1.00x
Prior to polishing this sequence is length 4797778 at the 005_final_clean stage.
When converting the 001_best_spades_graph to fasta format and then aligning the spades sequence to the final Unicycler contig we see that one segment (43, which will call "S_43" length 7563 bases is essentially missing. Below is the segment line (excluding sequence) from the 001_best_spades_graph
S 43 LN:i:7563 dp:f:0.9080597747657686 LB:z:0.908 CL:z:forestgreen
and the link line:
L 43 + 74 + 127M
Also in the spade .gfa no path line contains segment 43 but there is one that contains segment 74:
P NODE_2 40+,63-,49+,63-,3+,74-,66- 127M,127M,127M,127M,127M,127M
The only portion of S_43 aligning to the final large contig (in 005_final_clean ) is the very end in a few hundred bases in what appears to be a small tandem repeat:
7396 7563 | 4797663 4797496 | 168 168 | 95.83 | S_43 S_1
7437 7560 | 4797778 4797655 | 124 124 | 93.55 | S_43 S_1
Going through the stages of Unicycler we see that 001_best_spades_graph S_43 is present in 002_overlaps_removed also as segment 43 (but now in 002_overlaps_removed the length is 7373 rather than 7563)
64 7436 | 1 7373 | 7373 7373 | 100.00 | S_43 S_43
And then when aligning the 001_best_spades_graph S_43 to the 003_long_read_assembly sequence it appears to span the end of segment 4 (375510 bases) and beginning of segment 3:
3 2933 | 372591 375510 | 2931 2920 | 92.01 | S_43 S_4
4407 7553 | 1 3125 | 3147 3125 | 92.96 | S_43 S_3
Then aligning 001_best_spades_graph S_43 to the 004_bridges_applied sequence we see:
64 7436 | 1 7373 | 7373 7373 | 100.00 | S_43 S_43
In the log file I do not see segment 43 showing up in the Building long read bridges / Applying bridges stage. And then it ends up in the "Removed unbridging segments" section
Segments eligible for deletion: 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117 Removed unbridging segments: 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77
So it seems that the final assembly is missing ~7.5kb and that if present this would make the chromosome circular or closer to circular. Here is another look where a very close reference from nt is found and how the ends of segment 1 (005_final_clean) aligns on each end:
1 375497 | 4218492 4593957 | 375497 375466 | 99.96 | S_1 gi_ref
4469970 4797778 | 3883064 4210867 | 327809 327804 | 99.89 | S_1 gi_ref
and then how the 001_best_spades_graph segment 43 would nearly fill in that gap in the reference (from ~4.211 to 4.218 Mb):
1 7563 | 4218429 4210867 | 7563 7563 | 97.54 | S_43 gi_ref
I am re-running this example with --keep 3 in case any intermediate files are needed to further troubleshoot.
Dependencies:
Program Version Status
spades.py 3.12.0 good
racon - good
makeblastdb 2.5.0+ good
tblastn 2.5.0+ good
bowtie2-build 2.3.4.3 good
bowtie2 2.3.4.3 good
samtools 1.9 good
java 1.8.0_92 good
pilon 1.23 good
bcftools not used
I am not sure why racon comes up as " - " as it should be v1.3.1.
I can also provide other example details (other samples / assemblies with missing sequence that was present in spades step) if needed. Thank you for any help you can give with this issue.
Updates:
In the section below segment 43 from spades was not part of the anchor contigs:
Unicycler now selects a set of anchor contigs from the single-copy contigs. These are the contigs which will be connected via bridges to form the final assembly.
41 anchor segments (4,757,501 bp) out of 77 total segments (4,796,509 bp)
Anchor segments: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41
Looking now at read alignments I do see a number of reads that appear to link all_segments 43 to two other segments (namely 7 and 3, which will ultimately become the two ends of the final contig). The following are just a few example reads from each end of segment 43
34769/278344: c6de08f7-402d-4af9-b6e7-e6be1d1a0954 (6217 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 0 601 0 605 - 1369 90.73 88.61% 7 0 5507 717 6217 + 13157 92.20 90.32%
1125/278344: 099ede44-e5ca-4021-a1f5-028943251b20 (3345 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 0 985 0 958 - 2208 90.91 88.93% 7 0 2239 1085 3345 + 5680 93.88 92.37%
2516/278344: 2fc811cc-45c1-423a-93c2-a30091708239 (4043 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 0 3269 0 3328 - 6931 88.59 85.36% 7 0 583 3459 4043 + 1074 85.79 82.37%
14613/278344: a849e752-15d2-4964-9171-93323831255d (2555 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 0 556 0 577 - 1268 90.19 87.15% 7 0 1858 700 2555 + 4795 94.68 93.69%
40080/278344: de33987e-054d-4327-98e8-3a763c6d2a38 (4442 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 4919 7373 0 2439 + 6291 94.54 93.26% 75 0 126 2439 2572 + 289 90.28 87.50% 3 286816 288541 2699 4442 - 4453 94.54 93.41%
79948/278344: 58ad788e-abd6-405b-96d0-158c12ef47bd (4235 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 0 1108 0 1094 - 2305 88.41 84.72% 7 0 3041 1222 4235 + 5505 85.29 80.67%
23099/278344: 9f370aee-c081-4868-b81a-534e74878c33 (1934 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 6105 7373 0 1255 + 3015 92.16 90.11% 75 0 126 1257 1384 + 312 93.13 91.60% 75 0 126 1384 1515 + 340 95.29 93.94% 3 288129 288541 1515 1934 - 1049 93.96 92.74%
18228/278344: a7bb00d1-de20-4fd6-af6e-05d70ecc133b (1857 bp) Ref name Ref start Ref end Read start Read end Strand Raw score Scaled score Identity 43 6193 7373 0 1170 + 2723 91.32 88.92% 75 0 126 1299 1424 + 351 97.38 96.85% 3 288115 288541 1424 1857 - 969 90.65 88.64%
And here is how all_segments fasta 7, 3, and 75 will ultimately end up in final assembly:
1 221657 | 1 221657 | 221657 221657 | 100.00 | 7 1
1 126 | 4797562 4797437 | 126 126 | 97.62 | 75 1
1 124 | 4797718 4797595 | 124 124 | 93.55 | 75 1
1 288541 | 4509178 4797718 | 288541 288541 | 100.00 | 3 1
288258 288385 | 4797591 4797718 | 128 128 | 95.31 | 3 1 288414 288541 | 4797435 4797562 | 128 128 | 95.31 | 3 1
Would be curious to know if anyone else is having this issue of Unicycler dropping sequence during the assembly process. We've implemented a work around to find sequence in the spades assembly that are missing in the final assembly & manual review to add those contigs back in. This is recovering up to 55kb of sequence.
We're still seeing Unicycler dropping sequence that is precent in the spades assembly. Would be great if someone could take a look at this.
Hi,
I just met this exact problem for a bacterial assembly, with illumina reads only, but for a much longer sequence of about 110kb. It is present in every single spades assembly as either one or multiple nodes, it is not selected as anchor and ends up being removed at the 005 'Remove unbridging segments' step.
In the .gfa files for all the steps before, and including in the spades graph, it has the particularity to be only connected on one end to a single node that is not connected to any other node in the graph. It is also removed at the 'Removed unbridging segments' step.
I don't really see which unicycler option I could eventually twick to keep those sequences from being removed. If anyone has a clear solution on how to keep those sequences automatically, especially the long 110kb one, I'd gladly take it.
Adelme