racon
racon copied to clipboard
Illumina paired end reads worsens results
We try to polish a rather fragmented genome assembly with racon. So far we are testing thing on Yeast S288C, so we have a decent reference to check for performance. We use very low coverage nanopore reads to create unpolished contigs. We attempted to polish the data with 150bp paired-end Illumina-Miseq reads.
We run the following pipeline at standard options:
minimap2 -ax sr [unpolished contigs] [Illumina 1] [Illumina 2] > [sam]
cat [Illumina 1] [Illumina 2] > [Illumina Joined]
racon [Illumina Joined] [sam] [unpolished contigs] > [polished contigs]
Using this setup, we see noticeable worsening of the results. We map against the reference using Quast.
Before Racon:
After Racon:
Please notice how especially local mis-assemblies especially increase, but also translocations and relocations.
We would greatly appreciate your opinion on this effect.
Hello,
could you please verify if your paired end reads have equal names up to the first white space?
Best regards, Robert
Yes, this is the case.
File 1
[thomas@vodka experimental]$ head -n 12 /scr/k70san2/thomas/yeast_nanopore/YeastStrainsStudy/fastqs/miseq/s288c_1.fastq
@ERR1938683.1 MS5_18429:1:2113:11939:14368#1/1
GTGGTGTGTGGTGTGGTGTGTGGAGTGGTGGGGGGGTGGGGTGGGTGGGTGGGGAGTGGGGTGGTGAGGGGGTGTGGTGTGTGGGGGGGTGGGTGGGGAGGGGGGGGGGTGGGGGGGTGGGGGGGGGGGGTGGGGGGTGGGGGGGTGGGG
+
AAAAAAAA1CA1FFEEFFAEFGG101BF100A//<<[email protected]?99--.;-=;---9-9---9:-/--9---9--;--;9-9---;-;-----;----;--------9-@->-;---;----9-;----9-;9=---9-----------
@ERR1938683.2 MS5_18429:1:2106:27932:17792#1/1
GTGTGTGTGTGGGTGTGGTGTGGTGTGGGGGGTGGGGGGTGGGGGTGGGGGGGGGGGGCGGGGGGGATGTGGGGGCGGGTGGTGGTGGGGTGTGGGGGGGGGGGAGGGGGGGCGCGGGGATGCGCGAGGTGCGGTCGTGGGGTATGTTGT
+
AAAAAFFBCF>AEEEEEE0AAAE0AAE0B0//<E////9-;-A--;@--9@-9----;-9@@------:/9/--9-->-------99----9-----------;-9-------;-------9-9-9--9--/-----9----;-///9/-
@ERR1938683.3 MS5_18429:1:2104:19853:5516#1/1
CACCACACCCACACACCCACACACCACACCCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCCCACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTCCCCAACCTGTCTCTCAACTTA
+
BBB@BBBBBBBBGGGGFGGGGGEGGD222EEAECE2FGA1AEFGAE?FEGGCEAEGGGEEGDHAEEEE1GHHGGH11EE?/>EE/EEA/</BFCB2222FB??FC22GH2F/CA<C2<0?F1?1G0??///<0.<<001>1F=<G0<=0D
File 2:
[thomas@vodka experimental]$ head -n 12 /scr/k70san2/thomas/yeast_nanopore/YeastStrainsStudy/fastqs/miseq/s288c_2.fastq
@ERR1938683.1 MS5_18429:1:2113:11939:14368#1/2
ACACACCACACCCACACCCACACACCACAACCACGCCCCACACCCAAAAACAGACCAACACCACGCCCGCCGAACCACTACACAAAACACAAACCCCCACACCAACACAGCACCCCAACCATCAACAACACACCCCCACCCTCCTAAACC
+
AAAAAAAAAAAAAAEECEGGGG?E0EA0F00/AF//////AE///AFF////000?0?//B//?///>/////>>/0/10010B/F00/0/<0//?/////<///////0/00?/><....1>><01....<...---..:..:/;;0/9
@ERR1938683.2 MS5_18429:1:2106:27932:17792#1/2
CACACACCACACCCAACCCCAAACACCACCAATCCCCACAGAGGCACGAAGAGCAAAAGGGCCAGACACCGCGCACACCCCAACACAAAAAAACTCAGAAACAAGGTATCGAACAAAATAATTAGCAACATCTCGCAAATACGACACCAC
+
1>AA?1AAAAAAG1110A0A000000B0B00AB00B/B/A010/B/0//A//00AB100//////0/0BF/>///>/0?/>//>F///00?///B111B1100/0/112>//00/0?011111?1011000<110>-.-000...<-.;.
@ERR1938683.3 MS5_18429:1:2104:19853:5516#1/2
GTGGTTCGGAGTGGTATGGTTGAATGGGACAGGGTAACGAGTGGAGGCAGGGTAATGGAGGGTAAGTTGAGAGACAGGTTGGCCAGGGTTAGATTAGGGCTGTGTTAGGGTAGTGTTAGGATGTGTGTGTGTGGGTGTGGTGTGGTGGGG
+
ABABAABA?ADBCFGGGFFFGEFHHHCGHHGGGGFEGFAFAEFGEAEDFEGCFEBGHFGGGE?CFGGFH5CGHGGFHG3@GGF3FGHEGH1FBGHDGFHFAFHGFFFHGFB3?BFBB?EGFDG2BFGGFCGBGGG/?/?/0???/<?/A<
Well that is probably the problem. Racon expects that all reads have unique identifiers. In your case, only half of your Illumina reads will be used in polishing. You should add something to the identifiers to distinguish them and then run minimap2 and racon again.
On the other hand, what coverage of long reads do you have? I would advise one iteration of long read polishing before short read polishing.
Thank you. I indeed did not think about this problem. As it seems, minimap2 and racon both remove the /[0-9]
suffix, as they consider only the id to the first space, and minimap2 will remove this in any case. However, this means it is not possible to run minimap2 in paired-end mode, but rather needs to be run as single-end on the joined file with changed ids. Is this correct or is there we workaround I do not see?
Yes, you will have to run single-end mode. Not sure if you will lose much accuracy in comparison to paired-end mode.
I now joined both Illumina files and assigned running ids for each read. The results now look better, but are overall still worse. This is consistent, independent of whether running a correction step with nanopore reads before or not.
Hmm, quite odd that it decreases the accuracy. Would you mind sharing the data so I can investigate locally what is happening? Did you maybe try any other polisher in the same setting?
I have tested the same setup using PILON, seeing only improvements. I tested even the same alignment files for both Racon and PILON, Racon worsening results but PILON improving them.
The assembled contigs ("contigs.fa"), as well as the true reference of Yeast S288C can be found in the attached archive. sequences.tar.gz
The Illumina reads used can be downloaded here: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR193/003/ERR1938683/ERR1938683_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR193/003/ERR1938683/ERR1938683_2.fastq.gz
Thank you for your help.
Thanks, will check it out!
I have run your data with both Racon and Pilon, and evaluated the results with dnadiff (from the mummer package). The assembly is quite fragmented, but I am not sure what is causing the increase in breakpoints of Racon polished contigs even thought the number of SNPs and indels is smaller (and higher average identity). I tried different alignment parameters but the result is almost the same. I'll try inspecting the poa graphs to see what is happening.
Length | 1-to-1 alignments | Breakpoints | SNPs | Indels | Avg Identity (%) | Complete BUSCOs (%) | Fragmented BUSCOs (%) | Missing BUSCOs (%) | |
---|---|---|---|---|---|---|---|---|---|
Layout | 14859762 | 659 | 2090 | 55076 | 216744 | 96.51 | |||
Pilon | 14998813 | 661 | 2130 | 21416 | 42462 | 98.54 | 4.2 | 3.8 | 92.0 |
Racon | 14790137 | 851 | 3588 | 13644 | 35218 | 98.99 | 5.6 | 2.9 | 91.5 |
Were the long reads and the Illumina reads obtained from the same strain?
Hi. Yes. The organism/strain is exactly the same (Saccharomyces cerevisiae s288c). The fragmentation of the contigs is due to our method in development that is still missing a few components. However, I do not understand how this could break racon. Thank your very much for spending time on this problem.
I have updated the table above with BUSCO scores (found with saccharomyceta_odb9). It seems that the breakpoints do not have that much of an impact I guess, not sure though.
Hello. I've recently ran racon/illumina polishing and seems like the latest racon version I have installed (1.3.3) is doing something strange with the assemblies. My bacterial long-read (flye) assembly got 25 kb shorter on the first round of polishing. I've done similar things before and changes are usually in the range of several hundred nt. And a bigger (600 Mb) genome assembled by flye or canu has number of BUSCOs go down substantially after polishing.
Shorter assemblies after polishing are not that uncommon. Although, I am not sure why the BUSCO scores would drop. Did you try Pilon in the same setting?
Yes, and the scores have increased. That's why I became fairly certain something is wrong.
No idea why. Nothing special changed between 1.3.2
and 1.3.3
. If you had PE reads, did you combine them as described above?
Yes I did forget to replace the space in the concatenated file :( very embarrassing. Sorry for bothering you over nothing.
Haha, no problem! :)