fragemented and large assembly
Could you please tell your interpretation of this log file for a algae assembly attept and how to improve assembly contiguity for this highly heterogygous algal genome?
It is canu 2.2.
canu -assemble -p algae -d ./ genomeSize=1.2g -pacbio-hifi ../01_Data/hifi_decontamianted.fq useGrid=true gridOptions="--time=02-00:00:00 "
--
-- G=60000011670 sum of || length num
-- NG length index lengths || range seqs
-- ----- ------------ --------- ------------ || ------------------- -------
-- 00010 25230 216225 6000008603 || 1012-2263 909|-
-- 00020 22744 467717 12000021196 || 2264-3515 18289|---
-- 00030 21088 742150 18000004959 || 3516-4767 51620|-------
-- 00040 19816 1035933 24000015574 || 4768-6019 53288|-------
-- 00050 18758 1347318 30000018333 || 6020-7271 47811|------
-- 00060 17829 1675521 36000008872 || 7272-8523 41756|------
-- 00070 16973 2020455 42000022999 || 8524-9775 36502|-----
-- 00080 16110 2383132 48000017181 || 9776-11027 32306|----
-- 00090 14977 2768059 54000013441 || 11028-12279 31468|----
-- 00100 1012 3347938 60000011670 || 12280-13531 52652|-------
-- 001.000x 3347939 60000011670 || 13532-14783 167900|---------------------
-- || 14784-16035 400316|-------------------------------------------------
-- || 16036-17287 522679|---------------------------------------------------------------
-- || 17288-18539 470618|---------------------------------------------------------
-- || 18540-19791 377504|----------------------------------------------
-- || 19792-21043 291199|------------------------------------
-- || 21044-22295 219015|---------------------------
-- || 22296-23547 163507|--------------------
-- || 23548-24799 119442|---------------
-- || 24800-26051 85834|-----------
-- || 26052-27303 60782|--------
-- || 27304-28555 40895|-----
-- || 28556-29807 26685|----
-- || 29808-31059 16445|--
-- || 31060-32311 9319|--
-- || 32312-33563 4961|-
-- || 33564-34815 2455|-
-- || 34816-36067 1007|-
-- || 36068-37319 365|-
-- || 37320-38571 129|-
-- || 38572-39823 70|-
-- || 39824-41075 43|-
-- || 41076-42327 27|-
-- || 42328-43579 35|-
-- || 43580-44831 13|-
-- || 44832-46083 22|-
-- || 46084-47335 13|-
-- || 47336-48587 16|-
-- || 48588-49839 10|-
-- || 49840-51091 9|-
-- || 51092-52343 8|-
-- || 52344-53595 7|-
-- || 53596-54847 2|-
-- || 54848-56099 3|-
-- || 56100-57351 1|-
-- || 57352-58603 0|
-- || 58604-59855 0|
-- || 59856-61107 0|
-- || 61108-62359 1|-
-- || 62360-63611 1|-
--
[UNITIGGING/MERS]
--
-- 22-mers Fraction
-- Occurrences NumMers Unique Total
-- 1- 1 0 0.0000 0.0000
-- 2- 2 70353412 ********************* 0.0786 0.0032
-- 3- 5 163067656 ************************************************** 0.1444 0.0073
-- 6- 10 226452057 ********************************************************************** 0.3153 0.0246
-- 11- 17 184614141 ********************************************************* 0.5559 0.0683
-- 18- 26 63289857 ******************* 0.7338 0.1205
-- 27- 37 28132636 ******** 0.7944 0.1479
-- 38- 50 20924224 ****** 0.8242 0.1676
-- 51- 65 16517590 ***** 0.8468 0.1882
-- 66- 82 12640673 *** 0.8648 0.2097
-- 83- 101 8333925 ** 0.8786 0.2307
-- 102- 122 8099458 ** 0.8876 0.2477
-- 123- 145 11588961 *** 0.8968 0.2690
-- 146- 170 11098585 *** 0.9098 0.3051
-- 171- 197 7138387 ** 0.9220 0.3443
-- 198- 226 6849776 ** 0.9299 0.3741
-- 227- 257 3515630 * 0.9374 0.4068
-- 258- 290 3859564 * 0.9413 0.4258
-- 291- 325 7078389 ** 0.9457 0.4505
-- 326- 362 7034613 ** 0.9536 0.5008
-- 363- 401 7501105 ** 0.9615 0.5564
-- 402- 442 3378618 * 0.9698 0.6210
-- 443- 485 5011975 * 0.9735 0.6529
-- 486- 530 7194526 ** 0.9792 0.7076
-- 531- 577 4579124 * 0.9872 0.7905
-- 578- 626 2679194 0.9922 0.8475
-- 627- 677 986289 0.9952 0.8837
-- 678- 730 331835 0.9963 0.8980
-- 731- 785 146144 0.9966 0.9032
-- 786- 842 109799 0.9968 0.9057
-- 843- 901 102121 0.9969 0.9077
-- 902- 962 89584 0.9970 0.9098
-- 963- 1025 84341 0.9971 0.9117
-- 1026- 1090 103332 0.9972 0.9136
-- 1091- 1157 149332 0.9973 0.9161
-- 1158- 1226 457665 0.9975 0.9200
-- 1227- 1297 856871 0.9980 0.9327
-- 1298- 1370 495135 0.9990 0.9575
-- 1371- 1445 126710 0.9995 0.9722
-- 1446- 1522 65082 0.9997 0.9762
-- 1523- 1601 41337 0.9997 0.9784
--
-- 0 (max occurrences)
-- 43805436174 (total mers, non-unique)
-- 895285547 (distinct mers, non-unique)
-- 0 (unique mers)
[UNITIGGING/OVERLAPS]
-- category reads % read length feature size or coverage analysis
-- ---------------- ------- ------- ---------------------- ------------------------ --------------------
-- middle-missing 24795 0.74 13476.41 +- 3388.92 2322.58 +- 2244.95 (bad trimming)
-- middle-hump 739 0.02 14355.48 +- 3584.40 5780.24 +- 3541.19 (bad trimming)
-- no-5-prime 18712 0.56 12681.51 +- 3525.81 3113.69 +- 3131.66 (bad trimming)
-- no-3-prime 18201 0.54 12411.73 +- 3722.96 3129.88 +- 3171.72 (bad trimming)
--
-- low-coverage 520646 15.55 12063.76 +- 3598.21 6.73 +- 3.40 (easy to assemble, potential for lower quality consensus)
-- unique 233029 6.96 13212.73 +- 3948.96 47.24 +- 15.63 (easy to assemble, perfect, yay)
-- repeat-cont 2241936 66.96 13203.57 +- 3469.65 438.75 +- 273.38 (potential for consensus errors, no impact on assembly)
-- repeat-dove 58328 1.74 21293.34 +- 2212.20 333.09 +- 212.73 (hard to assemble, likely won't assemble correctly or even at all)
--
-- span-repeat 79343 2.37 13634.71 +- 3466.81 4671.17 +- 4283.54 (read spans a large repeat, usually easy to assemble)
-- uniq-repeat-cont 86825 2.59 12438.43 +- 3173.49 (should be uniquely placed, low potential for consensus errors, no impact on assembly)
-- uniq-repeat-dove 36814 1.10 15585.68 +- 3226.67 (will end contigs, potential to misassemble)
-- uniq-anchor 8571 0.26 14485.44 +- 3635.17 3980.64 +- 3565.95 (repeat read, with unique section, probable bad read)
[UNITIGGING/ADJUSTMENT]
-- No report available.
[UNITIGGING/ERROR RATES]
--
-- ERROR RATES
-- -----------
-- --------threshold------
-- 3764374 fraction error fraction percent
-- samples (1e-5) error error
-- -------------------------- -------- --------
-- command line (-eg) -> 30.00 0.0300%
-- command line (-eM) -> 1000.00 1.0000%
-- mean + std.dev 0.47 +- 4 * 2.98 -> 12.41 0.0124%
-- median + mad 0.00 +- 4 * 0.00 -> 0.00 0.0000%
-- 90th percentile -> 1.00 0.0010% (enabled)
--
-- BEST EDGE FILTERING
-- -------------------
-- At graph threshold 0.0300%, reads:
-- available to have edges: 593234
-- with at least one edge: 558720
--
-- At max threshold 1.0000%, reads: (not computed)
-- available to have edges: 0
-- with at least one edge: 0
--
-- At tight threshold 0.0010%, reads with:
-- both edges below threshold: 470068
-- one edge above threshold: 70669
-- both edges above threshold: 17983
-- at least one edge: 558720
--
-- At loose threshold 0.0124%, reads with:
-- both edges below threshold: 501295
-- one edge above threshold: 49081
-- both edges above threshold: 8344
-- at least one edge: 558720
--
--
-- INITIAL EDGES
-- -------- ----------------------------------------
-- 2529404 reads are contained
-- 1150834 reads have no best edges (singleton)
-- 59063 reads have only one best edge (spur)
-- 53786 are mutual best
-- 383296 reads have two best edges
-- 38960 have one mutual best edge
-- 336273 have two mutual best edges
--
--
-- FINAL EDGES
-- -------- ----------------------------------------
-- 2529404 reads are contained
-- 1166426 reads have no best edges (singleton)
-- 57991 reads have only one best edge (spur)
-- 55393 are mutual best
-- 368776 reads have two best edges
-- 31203 have one mutual best edge
-- 331931 have two mutual best edges
--
--
-- EDGE FILTERING
-- -------- ------------------------------------------
-- 0 reads are ignored
-- 332467 reads have a gap in overlap coverage
-- 7840 reads have lopsided best edges
[UNITIGGING/CONTIGS]
-- Found, in version 1, after unitig construction:
-- contigs: 15569 sequences, total length 1113172293 bp (including 1969 repeats of total length 40251135 bp).
-- bubbles: 12105 sequences, total length 339428280 bp.
-- unassembled: 453881 sequences, total length 6160515805 bp.
--
-- Contig sizes based on genome size 1.2gbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 3986030 23 120863694
-- 20 2371456 62 241139173
-- 30 179641 407 360086301
-- 40 112244 1281 480111774
-- 50 82207 2551 600054836
-- 60 62827 4230 720052631
-- 70 47673 6430 840029890
-- 80 34867 9377 960029583
-- 90 21305 13696 1080014296
--
[UNITIGGING/CONSENSUS]
-- Found, in version 2, after consensus generation:
-- contigs: 15569 sequences, total length 1532185200 bp (including 1969 repeats of total length 54224892 bp).
-- bubbles: 12105 sequences, total length 465073080 bp.
-- unassembled: 453881 sequences, total length 8359912934 bp.
--
-- Contig sizes based on genome size 1.2gbp:
--
-- NG (bp) LG (contigs) sum (bp)
-- ---------- ------------ ----------
-- 10 6333588 16 123644693
-- 20 4280720 39 241558585
-- 30 1881180 76 360592077
-- 40 261411 368 480254527
-- 50 176426 937 600054053
-- 60 135459 1717 720026039
-- 70 110542 2702 840026471
-- 80 90976 3902 960052925
-- 90 74724 5358 1080073502
-- 100 60575 7141 1200005941
-- 110 48147 9358 1320015376
-- 120 35183 12255 1440004833
--
The assembly stat is below for the reference. Note that the assembly size is quite large as the expected genome size is around 1.2G.
sum = 1997258280, n = 27674, ave = 72170.93, largest = 12573277
N50 = 91503, n = 4209
N60 = 69640, n = 6718
N70 = 53286, n = 10008
N80 = 40965, n = 14298
N90 = 31636, n = 19849
N100 = 4488, n = 27674
N_count = 0
Gaps = 0
Thanks a lot in advance.
The larger size is expected, it's likely both haplotypes of a diploid genome (see https://canu.readthedocs.io/en/latest/faq.html#my-genome-size-and-assembly-size-are-different-help). You can see that about 500mb are already flagged as bubbles (alt haplotype). The rest likely is too diverged to be automatically flagged so you'd need to rely on a tool like purge_dups. As for the fragmentation, the coverage looks really low from the k-mer histogram. The primary peak is between 6-10x which is too low for a good assembly, what coverage were you inputting? Is this a clonal sample or a collection of individuals?
Thanks for a prompt reply, Sergey
This genome has puzzled me quite a bit. Total input hifi data is ~60X (assuming ~1.2 G genome size, which could be around 2G)
genomescope profile of the same organism with the short read data is here https://github.com/schatzlab/genomescope/issues/142
file format type num_seqs sum_len min_len avg_len max_len
../01_Data/hifi_dedup_decontamianted.fq FASTQ DNA 4,122,639 73,888,444,677 90 17,922.6 63,566
Note this is a Cladocopium app where the polidy and duplication levels are not clear. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9412976/
Any thoughts on how to proceed would be very useful to me.
The genomescope results imply a larger genome than 1.2 Gbp but also that the haplotypes are extremely similar (if it is diploid) as there are very few single-copy k-mers. You'd probably benefit from a larger k-mer size like k=31 instead of 19 for genomescope.
The HiFi assembly implies an even larger genome size, the coverage is somewhere around 8x given 50x * 1.2gb or 7gb which would imply a 3.5gb if diploid genome. HiFi assembly is going to be very sensitive to variation though so it makes me wonder if the inputs for the Illumina and HiFi data are the same? Is it possible the Illumina sample is more clonal than the sample for HiFi? Either way, I'd increase either the genome size or the maxInputCoverage since right now it's only use 50x * 1.2 gb so you have more data that was not used in the assembly. After that, your best option is probably to rely on core genes/purge_dups to determine if there is haplotype duplication in the assembly or not. You could also try verkko and look at the resulting assembly graphs to see if there is diploid structure (though it would likely be less continuous as it only produces phased outputs while canu can produce a pseudo-haplotype).
I've found issues with Canu when trying to carry out population genome sequencing; there's just too much population variation to construct long consensus contigs.
Idle, comments provide suggestions and I agree that HiFi and canu are not well suited for producing a single genome from population sequencing.