Unicycler icon indicating copy to clipboard operation
Unicycler copied to clipboard

Contigs not found.

Open deeptir-unimelb opened this issue 4 years ago • 1 comments

Hi All,

I am new to bioinformatics, self-learning as I go and have limited knowledge in it. I am trying to assembly an E. coli genome using Illumina reads and ONT reads. I was able to assemble 2 out of 3 samples using unicycler and it worked without the glitch. With the 3rd sample, I am unable to get a circularised genome. After the Racon polishing step, this is the message I am getting in the unicycler log file

Searching for contigs using 5,000 bp of contig ends.

Contig Result Start pos End pos Strand 1 not found
2 not found
3 not found
4 not found
5 not found
6 not found
7 found in unitig 5 0 206240 + 8 not found
9 not found
10 not found

It is able to find only 15 contigs to pace in the unitigs graph. Hene my final assembly has 72 nodes and is not circularised. Please help.

deeptir-unimelb avatar Mar 30 '20 01:03 deeptir-unimelb

Lots to unpack here. I have an idea of what’s going on from your reply to #122, but I’ll detail the whole process in case it helps someone else.

In general when I have a poor assembly, the first thing I do is look at the different graphs. There’s some guidance on this in the manual, but you’re looking at the .gfa outputs in Bandage. Is the spades assembly well bridged? Is the long read assembly the right size, does it have nice long contigs?

If you installed Unicycler recently by conda, you might have run into issue #218 - I certainly noticed a lot of unusually poor assemblies when I updated spades. As mentioned, reverting spades fixed that. I have Unicycler in an environment, so reverting doesn’t mess anything else up.

If everything’s good to this point, I’d then check whether the read sets are from the same isolate. There’s also another line in unicyclers log that states the alignment identity that can give you a clue. I see that yours is low (75%) in the log you attached in #122.

One quick way to do that is to assemble your short reads with Spades/shovill, assemble your long reads with Flye + polish w/ Racon, then generate a phylogeny using something like Parsnp. You could also run your reads through mash, but depending on your coverage it might wind up taking longer.

You’d expect to have six isolates in three clades - with each long and short read assembly sharing the same clade. You can also do some really quick stuff like mlst and abricate prior to parsnp as a quick sanity check. In the past I’ve assigned my barcode to the wrong sample ID for e.g, which was clear by mlst alone.

If they group logically, we can then take a look at the flye assembly and compare to unicyclers long read assembly (with the .gfa files into Bandage). If the flye assembly is better than Unicycler’s Miniasm attempt, you can use the flye assembly graph with unicycler using the --existing_long_read_assembly flag.

I’ve actually made this part of my standard workflow with Unicycler in our most recent E. coli sequencing set - although flye takes a little bit longer to assemble, It’s made a difference in maybe 30% of the isolates. Nothing ground breaking, but enough to complete some regions that miniasm struggled with.

If this is the same read set as #122 then I think you have reads belonging to different isolates. You should still get a good assembly with flye given you have ~85X coverage. If you don’t, then you’d need to look at your QC data (e.g read length and quality distributions). If your data looks fine, but your assembly is still poor, maybe you have some contamination. You could then for e.g. run your assembly through kraken to make sure you’re only assembling E.coli, and if so CheckM to see if there are gene duplications.

Hopefully this gives you some stuff to try, here’s my workflow for hybrid assemblies, assuming you’ve already QCed and are happy with the Illumina data.

Basecalling - guppy HAC Adaptor removal - qcat (can also --trim in guppy) QC - NanoStat (report on all read sets, looking at length, read N50/mean/median). Filtering - filtlong (depending on issue to address, typically we have too much data so I reduce to ~100X coverage with -t) Chimeric discard - Unicycler_scrub (we use a one pot protocol, and discard any chimeric reads). Long read assembly - Flye Long read polishing - racon x 5 (Unicycler_polish can handle this, I use a script. If you only have long reads, use Medaka and the recommended Racon config) Hybrid assembly - Unicycler (using the flye gfa file as input).

stevenjdunn avatar Mar 31 '20 08:03 stevenjdunn