FALCON
FALCON copied to clipboard
Assembled genome size under estimated
Hello folks @pb-cdunn @mseetin @pb-jchin Our genome size is 800Mbp and looks like highly repeat genome. I ran Falcon twice with different options, but it generated underestimated genome size. The problems are almost no overlap between reads and underestimated p_ctg size.
Can you please help to tune the assembly config?
Thanks in advance.
First run
fc.cfg
[General]
input_fofn = input.fofn
input_type = raw
length_cutoff = 5000
length_cutoff_pr = 1
genome_size = 750000000
pa_HPCdaligner_option = -v -B128 -e0.70 -M24 -l1000 -k18 -h1250 -w8 -s100
ovlp_HPCdaligner_option = -v -B128 -M24 -k24 -h1250 -e.96 -l500 -s100
pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -a -x500 -s200
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 3 --max_n_read 20000 --n_core 0
falcon_sense_skip_contained = False
overlap_filtering_setting = --max_diff 500 --max_cov 120 --min_cov 2 --bestn 100 --n_core 0
raw_reads_statistics
Statistics for all wells of length 500 bases or more
12,404,171 reads out of 13,066,465 ( 94.9%)
67,280,909,359 base pairs out of 67,464,504,646 ( 99.7%)
5,424 average read length
4,851 standard deviation
Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)
Distribution of Read Lengths (Bin size = 1,000)
Bin: Count % Reads % Bases Average
66,000: 2 0.0 0.0 66692
65,000: 2 0.0 0.0 66042
64,000: 3 0.0 0.0 65426
63,000: 5 0.0 0.0 64623
62,000: 1 0.0 0.0 64457
61,000: 6 0.0 0.0 63500
60,000: 3 0.0 0.0 63058
59,000: 6 0.0 0.0 62259
58,000: 7 0.0 0.0 61487
57,000: 7 0.0 0.0 60784
56,000: 10 0.0 0.0 59974
55,000: 26 0.0 0.0 58477
54,000: 20 0.0 0.0 57690
53,000: 21 0.0 0.0 56949
52,000: 39 0.0 0.0 55848
51,000: 41 0.0 0.0 54953
50,000: 51 0.0 0.0 54031
49,000: 73 0.0 0.0 52996
48,000: 83 0.0 0.0 52076
47,000: 64 0.0 0.0 51453
46,000: 100 0.0 0.0 50578
45,000: 97 0.0 0.0 49841
44,000: 143 0.0 0.1 48899
43,000: 193 0.0 0.1 47859
42,000: 229 0.0 0.1 46859
41,000: 277 0.0 0.1 45867
40,000: 319 0.0 0.1 44928
39,000: 364 0.0 0.1 44028
38,000: 471 0.0 0.2 43046
37,000: 552 0.0 0.2 42092
36,000: 725 0.0 0.2 41058
35,000: 938 0.0 0.3 39985
34,000: 1,172 0.0 0.3 38920
33,000: 1,399 0.1 0.4 37899
32,000: 1,796 0.1 0.5 36846
31,000: 2,266 0.1 0.6 35790
30,000: 2,886 0.1 0.7 34726
29,000: 3,652 0.1 0.9 33661
28,000: 4,768 0.2 1.1 32578
27,000: 6,093 0.2 1.4 31503
26,000: 7,767 0.3 1.7 30438
25,000: 9,885 0.4 2.0 29385
24,000: 12,761 0.5 2.5 28330
23,000: 16,400 0.6 3.1 27279
22,000: 20,988 0.8 3.8 26237
21,000: 27,198 1.0 4.6 25193
20,000: 34,791 1.3 5.7 24159
19,000: 45,414 1.6 7.0 23118
18,000: 59,216 2.1 8.6 22075
17,000: 77,280 2.7 10.6 21031
16,000: 102,043 3.6 13.1 19981
15,000: 135,447 4.7 16.3 18925
14,000: 181,480 6.1 20.2 17862
13,000: 242,889 8.1 25.0 16799
12,000: 315,298 10.6 30.9 15765
11,000: 378,731 13.7 37.3 14810
10,000: 444,865 17.3 44.3 13912
9,000: 520,277 21.5 51.6 13047
8,000: 542,020 25.8 58.5 12278
7,000: 556,461 30.3 64.7 11570
6,000: 611,893 35.2 70.6 10859
5,000: 694,694 40.8 76.2 10123
4,000: 823,748 47.5 81.7 9334
3,000: 1,061,032 56.0 87.2 8439
2,000: 1,560,111 68.6 92.9 7344
1,000: 2,574,667 89.4 98.5 5976
0: 1,317,905 100.0 100.0 5424
preassembly_stat
{
"genome_length": 750000000,
"length_cutoff": 5000,
"preassembled_bases": 3045888537,
"preassembled_coverage": 4.061,
"preassembled_esize": 11444.962,
"preassembled_mean": 10229.684,
"preassembled_n50": 10946,
"preassembled_p95": 16456,
"preassembled_reads": 297750,
"preassembled_seed_fragmentation": 1.053,
"preassembled_seed_truncation": 1299.576,
"preassembled_yield": 0.059,
"raw_bases": 67280909359,
"raw_coverage": 89.708,
"raw_esize": 9764.231,
"raw_mean": 5424.055,
"raw_n50": 9220,
"raw_p95": 14741,
"raw_reads": 12404171,
"seed_bases": 51290352072,
"seed_coverage": 68.387,
"seed_esize": 11919.481,
"seed_mean": 10123.013,
"seed_n50": 10886,
"seed_p95": 18151,
"seed_reads": 5066708
}
overlap_histogram
p_ctg_statistics
Total length of sequence: 130237263 bp
Total number of sequences: 8887
N25 stats: 25% of total sequence length is contained in the 547 sequences >= 41164
bp
N50 stats: 50% of total sequence length is contained in the 1595 sequences >= 24387
bp
N75 stats: 75% of total sequence length is contained in the 3373 sequences >= 13617
bp
Total GC count: 47945359 bp
GC %: 36.81 %
Second run fc.cfg
[General]
input_fofn = input.fofn
input_type = raw
length_cutoff = -1
length_cutoff_pr = 1
genome_size = 750000000
pa_HPCdaligner_option = -v -B128 -M32 -e.70 -l4800 -s100 -k18 -h480 -w8
ovlp_HPCdaligner_option = -v -B128 -M32 -h1024 -e.96 -l2400 -s100 -k18
pa_DBsplit_option = -a -x500 -s400
ovlp_DBsplit_option = -s400
falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 2 --max_n_read 200 --n_core 0
falcon_sense_skip_contained = True
overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2 --n_core 0
raw_reads_stat
Statistics for all wells of length 500 bases or more
12,404,171 reads out of 13,066,465 ( 94.9%)
67,280,909,359 base pairs out of 67,464,504,646 ( 99.7%)
5,424 average read length
4,851 standard deviation
Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)
Distribution of Read Lengths (Bin size = 1,000)
Bin: Count % Reads % Bases Average
66,000: 2 0.0 0.0 66692
65,000: 2 0.0 0.0 66042
64,000: 3 0.0 0.0 65426
63,000: 5 0.0 0.0 64623
62,000: 1 0.0 0.0 64457
61,000: 6 0.0 0.0 63500
60,000: 3 0.0 0.0 63058
59,000: 6 0.0 0.0 62259
58,000: 7 0.0 0.0 61487
57,000: 7 0.0 0.0 60784
56,000: 10 0.0 0.0 59974
55,000: 26 0.0 0.0 58477
54,000: 20 0.0 0.0 57690
53,000: 21 0.0 0.0 56949
52,000: 39 0.0 0.0 55848
51,000: 41 0.0 0.0 54953
50,000: 51 0.0 0.0 54031
49,000: 73 0.0 0.0 52996
48,000: 83 0.0 0.0 52076
47,000: 64 0.0 0.0 51453
46,000: 100 0.0 0.0 50578
45,000: 97 0.0 0.0 49841
44,000: 143 0.0 0.1 48899
43,000: 193 0.0 0.1 47859
42,000: 229 0.0 0.1 46859
41,000: 277 0.0 0.1 45867
40,000: 319 0.0 0.1 44928
39,000: 364 0.0 0.1 44028
38,000: 471 0.0 0.2 43046
37,000: 552 0.0 0.2 42092
36,000: 725 0.0 0.2 41058
35,000: 938 0.0 0.3 39985
34,000: 1,172 0.0 0.3 38920
33,000: 1,399 0.1 0.4 37899
32,000: 1,796 0.1 0.5 36846
31,000: 2,266 0.1 0.6 35790
30,000: 2,886 0.1 0.7 34726
29,000: 3,652 0.1 0.9 33661
28,000: 4,768 0.2 1.1 32578
27,000: 6,093 0.2 1.4 31503
26,000: 7,767 0.3 1.7 30438
25,000: 9,885 0.4 2.0 29385
24,000: 12,761 0.5 2.5 28330
23,000: 16,400 0.6 3.1 27279
22,000: 20,988 0.8 3.8 26237
21,000: 27,198 1.0 4.6 25193
20,000: 34,791 1.3 5.7 24159
19,000: 45,414 1.6 7.0 23118
18,000: 59,216 2.1 8.6 22075
17,000: 77,280 2.7 10.6 21031
16,000: 102,043 3.6 13.1 19981
15,000: 135,447 4.7 16.3 18925
14,000: 181,480 6.1 20.2 17862
13,000: 242,889 8.1 25.0 16799
12,000: 315,298 10.6 30.9 15765
11,000: 378,731 13.7 37.3 14810
10,000: 444,865 17.3 44.3 13912
9,000: 520,277 21.5 51.6 13047
8,000: 542,020 25.8 58.5 12278
7,000: 556,461 30.3 64.7 11570
6,000: 611,893 35.2 70.6 10859
5,000: 694,694 40.8 76.2 10123
4,000: 823,748 47.5 81.7 9334
3,000: 1,061,032 56.0 87.2 8439
2,000: 1,560,111 68.6 92.9 7344
1,000: 2,574,667 89.4 98.5 5976
0: 1,317,905 100.0 100.0 5424
preassemble_stat
{
"genome_length": 750000000,
"length_cutoff": 13537,
"preassembled_bases": 4214378579,
"preassembled_coverage": 5.619,
"preassembled_esize": 13574.367,
"preassembled_mean": 12046.382,
"preassembled_n50": 13708,
"preassembled_p95": 19004,
"preassembled_reads": 349846,
"preassembled_seed_fragmentation": 1.22,
"preassembled_seed_truncation": 2348.162,
"preassembled_yield": 0.281,
"raw_bases": 67280909359,
"raw_coverage": 89.708,
"raw_esize": 9764.231,
"raw_mean": 5424.055,
"raw_n50": 9220,
"raw_p95": 14741,
"raw_reads": 12404171,
"seed_bases": 15000768227,
"seed_coverage": 20.001,
"seed_esize": 18308.29,
"seed_mean": 17368.493,
"seed_n50": 16782,
"seed_p95": 25316,
"seed_reads": 863677
}
p_ctg_stat
Total length of sequence: 400909193 bp
Total number of sequences: 9739
N25 stats: 25% of total sequence length is contained in the 627 sequences >= 113996
bp
N50 stats: 50% of total sequence length is contained in the 1783 sequences >= 68837
bp
N75 stats: 75% of total sequence length is contained in the 3728 sequences >= 37641
bp
Total GC count: 147692278 bp
GC %: 36.84 %
Hi,
You have sufficient raw coverage (almost 90-fold) so there should be plenty of raw data to get a decent draft assembly. What I see from your pre-asesmbled statistics, however is that you only have 5-6fold coverage of pre-assembled reads "preassembled_coverage": 5.619,
You will never achieve a contiguous or complete assembly with only 5-fold preads, you need closer to 15-25-fold pread coverage above a certain length threshold if you want to achieve a highly contiguous assembly.
You need to start by troubleshooting your pre-assembly. It may be beneficial to raise -e.70 to as much as -e.75 depending on the quality of your data.
Also, we generally don't recommend using the -a option in DBsplit : pa_DBsplit_option = -a -x500 -s400
Including the -a
option will result in all subreads from all ZMWs being used. Excluding the -a
option will result in only the best subread from a particular ZMW being used, which may have a large effect in a highly repetitive genome.
Hope this helps
I will restart from your recommendation and I will update results.
Thanks.
Here is follow-up result.
I removed -a option. Change -e.70 to -e.75 change length_cutoff = -1 -> 3000 Looks not enough. Any other recommendation?
preads_stat
cat 0-rawreads/report/pre_assembly_stats.json
{
"genome_length": 750000000,
"length_cutoff": 3000,
"preassembled_bases": 6175542129,
"preassembled_coverage": 8.234,
"preassembled_esize": 11399.998,
"preassembled_mean": 10113.361,
"preassembled_n50": 10956,
"preassembled_p95": 16653,
"preassembled_reads": 610632,
"preassembled_seed_fragmentation": 1.181,
"preassembled_seed_truncation": 1101.777,
"preassembled_yield": 0.13,
"raw_bases": 50058837997,
"raw_coverage": 66.745,
"raw_esize": 11049.715,
"raw_mean": 7445.514,
"raw_n50": 10321,
"raw_p95": 16912,
"raw_reads": 6723356,
"seed_bases": 47453855282,
"seed_coverage": 63.272,
"seed_esize": 11542.476,
"seed_mean": 9011.791,
"seed_n50": 10669,
"seed_p95": 17884,
"seed_reads": 5265752
}
p_ctg_stat
Total length of sequence: 439610496 bp
Total number of sequences: 10732
N25 stats: 25% of total sequence length is contained in the 628 sequences >= 123886
bp
N50 stats: 50% of total sequence length is contained in the 1805 sequences >= 72968
bp
N75 stats: 75% of total sequence length is contained in the 3855 sequences >= 39183
bp
Total GC count: 162129330 bp
GC %: 36.88 %
I agree pread correction doesn't appear to be proceeding as efficiently as it should be, which is still what's limiting the assembly. However, I also notice in your latest assembly that you're not using the same full dataset as the previous two assemblies, so it's not a fair apples to apples comparison. (look at the Raw read stats - the latest version appears to be starting with roughly 23-fold less coverage (~90X vs ~67X))
One thing I notice is that between your initial assembly, and your second version is that you raised the raw read overlapping parameter from -l1000
to -l4800
. Though there is slightly more corrected pread coverage from the second assembly, I have a feeling the very high overlap length is restricting the amount of data that is being corrected. With a raw n50
of only 9220
(in your first assemblies) restricting the raw read overlaps to -l4800
would exclude a significant portion of the data. Maybe you should drop this cutoff to -l1500
or -l2000
or so.
Greg,
I used same input but I don't know why did it happen. I just checked input.fofn but there's no differences. Anyway I will reduce to overlapping parameter.
Thank you for your help.
Won