FALCON Assembled genome size under estimated

Hello folks @pb-cdunn @mseetin @pb-jchin Our genome size is 800Mbp and looks like highly repeat genome. I ran Falcon twice with different options, but it generated underestimated genome size. The problems are almost no overlap between reads and underestimated p_ctg size.

Can you please help to tune the assembly config?

Thanks in advance.

First run

fc.cfg

[General]
input_fofn = input.fofn
input_type = raw

length_cutoff = 5000
length_cutoff_pr = 1
genome_size = 750000000

pa_HPCdaligner_option =  -v -B128 -e0.70 -M24 -l1000 -k18 -h1250 -w8 -s100

ovlp_HPCdaligner_option = -v -B128 -M24 -k24 -h1250 -e.96 -l500 -s100

pa_DBsplit_option = -a -x500 -s200
ovlp_DBsplit_option = -a -x500  -s200

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 3 --max_n_read 20000 --n_core 0
falcon_sense_skip_contained = False

overlap_filtering_setting = --max_diff 500  --max_cov 120 --min_cov 2 --bestn 100 --n_core 0

raw_reads_statistics


Statistics for all wells of length 500 bases or more

     12,404,171 reads        out of      13,066,465  ( 94.9%)
 67,280,909,359 base pairs   out of  67,464,504,646  ( 99.7%)

          5,424 average read length
          4,851 standard deviation

  Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)

  Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
     66,000:          2      0.0      0.0       66692
     65,000:          2      0.0      0.0       66042
     64,000:          3      0.0      0.0       65426
     63,000:          5      0.0      0.0       64623
     62,000:          1      0.0      0.0       64457
     61,000:          6      0.0      0.0       63500
     60,000:          3      0.0      0.0       63058
     59,000:          6      0.0      0.0       62259
     58,000:          7      0.0      0.0       61487
     57,000:          7      0.0      0.0       60784
     56,000:         10      0.0      0.0       59974
     55,000:         26      0.0      0.0       58477
     54,000:         20      0.0      0.0       57690
     53,000:         21      0.0      0.0       56949
     52,000:         39      0.0      0.0       55848
     51,000:         41      0.0      0.0       54953
     50,000:         51      0.0      0.0       54031
     49,000:         73      0.0      0.0       52996
     48,000:         83      0.0      0.0       52076
     47,000:         64      0.0      0.0       51453
     46,000:        100      0.0      0.0       50578
     45,000:         97      0.0      0.0       49841
     44,000:        143      0.0      0.1       48899
     43,000:        193      0.0      0.1       47859
     42,000:        229      0.0      0.1       46859
     41,000:        277      0.0      0.1       45867
     40,000:        319      0.0      0.1       44928
     39,000:        364      0.0      0.1       44028
     38,000:        471      0.0      0.2       43046
     37,000:        552      0.0      0.2       42092
     36,000:        725      0.0      0.2       41058
     35,000:        938      0.0      0.3       39985
     34,000:      1,172      0.0      0.3       38920
     33,000:      1,399      0.1      0.4       37899
     32,000:      1,796      0.1      0.5       36846
     31,000:      2,266      0.1      0.6       35790
     30,000:      2,886      0.1      0.7       34726
     29,000:      3,652      0.1      0.9       33661
     28,000:      4,768      0.2      1.1       32578
     27,000:      6,093      0.2      1.4       31503
     26,000:      7,767      0.3      1.7       30438
     25,000:      9,885      0.4      2.0       29385
     24,000:     12,761      0.5      2.5       28330
     23,000:     16,400      0.6      3.1       27279
     22,000:     20,988      0.8      3.8       26237
     21,000:     27,198      1.0      4.6       25193
     20,000:     34,791      1.3      5.7       24159
     19,000:     45,414      1.6      7.0       23118
     18,000:     59,216      2.1      8.6       22075
     17,000:     77,280      2.7     10.6       21031
     16,000:    102,043      3.6     13.1       19981
     15,000:    135,447      4.7     16.3       18925
     14,000:    181,480      6.1     20.2       17862
     13,000:    242,889      8.1     25.0       16799
     12,000:    315,298     10.6     30.9       15765
     11,000:    378,731     13.7     37.3       14810
     10,000:    444,865     17.3     44.3       13912
      9,000:    520,277     21.5     51.6       13047
      8,000:    542,020     25.8     58.5       12278
      7,000:    556,461     30.3     64.7       11570
      6,000:    611,893     35.2     70.6       10859
      5,000:    694,694     40.8     76.2       10123
      4,000:    823,748     47.5     81.7        9334
      3,000:  1,061,032     56.0     87.2        8439
      2,000:  1,560,111     68.6     92.9        7344
      1,000:  2,574,667     89.4     98.5        5976
          0:  1,317,905    100.0    100.0        5424

preassembly_stat

{
    "genome_length": 750000000,
    "length_cutoff": 5000,
    "preassembled_bases": 3045888537,
    "preassembled_coverage": 4.061,
    "preassembled_esize": 11444.962,
    "preassembled_mean": 10229.684,
    "preassembled_n50": 10946,
    "preassembled_p95": 16456,
    "preassembled_reads": 297750,
    "preassembled_seed_fragmentation": 1.053,
    "preassembled_seed_truncation": 1299.576,
    "preassembled_yield": 0.059,
    "raw_bases": 67280909359,
    "raw_coverage": 89.708,
    "raw_esize": 9764.231,
    "raw_mean": 5424.055,
    "raw_n50": 9220,
    "raw_p95": 14741,
    "raw_reads": 12404171,
    "seed_bases": 51290352072,
    "seed_coverage": 68.387,
    "seed_esize": 11919.481,
    "seed_mean": 10123.013,
    "seed_n50": 10886,
    "seed_p95": 18151,
    "seed_reads": 5066708
}

overlap_histogram

OvlpHist_1.pdf

p_ctg_statistics

Total length of sequence:       130237263 bp
Total number of sequences:      8887
N25 stats:                      25% of total sequence length is contained in the 547 sequences >= 41164
bp
N50 stats:                      50% of total sequence length is contained in the 1595 sequences >= 24387
bp
N75 stats:                      75% of total sequence length is contained in the 3373 sequences >= 13617
bp
Total GC count:                 47945359 bp
GC %:                           36.81 %

Second run fc.cfg

[General]
input_fofn = input.fofn
input_type = raw
length_cutoff = -1
length_cutoff_pr = 1
genome_size = 750000000

pa_HPCdaligner_option =  -v -B128 -M32 -e.70 -l4800 -s100 -k18 -h480 -w8
ovlp_HPCdaligner_option = -v -B128 -M32 -h1024 -e.96 -l2400 -s100 -k18

pa_DBsplit_option = -a -x500 -s400
ovlp_DBsplit_option = -s400

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 2 --max_n_read 200 --n_core 0
falcon_sense_skip_contained = True

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 2 --n_core 0

raw_reads_stat

Statistics for all wells of length 500 bases or more

     12,404,171 reads        out of      13,066,465  ( 94.9%)
 67,280,909,359 base pairs   out of  67,464,504,646  ( 99.7%)

          5,424 average read length
          4,851 standard deviation

  Base composition: 0.311(A) 0.182(C) 0.192(G) 0.315(T)

  Distribution of Read Lengths (Bin size = 1,000)

        Bin:      Count  % Reads  % Bases     Average
     66,000:          2      0.0      0.0       66692
     65,000:          2      0.0      0.0       66042
     64,000:          3      0.0      0.0       65426
     63,000:          5      0.0      0.0       64623
     62,000:          1      0.0      0.0       64457
     61,000:          6      0.0      0.0       63500
     60,000:          3      0.0      0.0       63058
     59,000:          6      0.0      0.0       62259
     58,000:          7      0.0      0.0       61487
     57,000:          7      0.0      0.0       60784
     56,000:         10      0.0      0.0       59974
     55,000:         26      0.0      0.0       58477
     54,000:         20      0.0      0.0       57690
     53,000:         21      0.0      0.0       56949
     52,000:         39      0.0      0.0       55848
     51,000:         41      0.0      0.0       54953
     50,000:         51      0.0      0.0       54031
     49,000:         73      0.0      0.0       52996
     48,000:         83      0.0      0.0       52076
     47,000:         64      0.0      0.0       51453
     46,000:        100      0.0      0.0       50578
     45,000:         97      0.0      0.0       49841
     44,000:        143      0.0      0.1       48899
     43,000:        193      0.0      0.1       47859
     42,000:        229      0.0      0.1       46859
     41,000:        277      0.0      0.1       45867
     40,000:        319      0.0      0.1       44928
     39,000:        364      0.0      0.1       44028
     38,000:        471      0.0      0.2       43046
     37,000:        552      0.0      0.2       42092
     36,000:        725      0.0      0.2       41058
     35,000:        938      0.0      0.3       39985
     34,000:      1,172      0.0      0.3       38920
     33,000:      1,399      0.1      0.4       37899
     32,000:      1,796      0.1      0.5       36846
     31,000:      2,266      0.1      0.6       35790
     30,000:      2,886      0.1      0.7       34726
     29,000:      3,652      0.1      0.9       33661
     28,000:      4,768      0.2      1.1       32578
     27,000:      6,093      0.2      1.4       31503
     26,000:      7,767      0.3      1.7       30438
     25,000:      9,885      0.4      2.0       29385
     24,000:     12,761      0.5      2.5       28330
     23,000:     16,400      0.6      3.1       27279
     22,000:     20,988      0.8      3.8       26237
     21,000:     27,198      1.0      4.6       25193
     20,000:     34,791      1.3      5.7       24159
     19,000:     45,414      1.6      7.0       23118
     18,000:     59,216      2.1      8.6       22075
     17,000:     77,280      2.7     10.6       21031
     16,000:    102,043      3.6     13.1       19981
     15,000:    135,447      4.7     16.3       18925
     14,000:    181,480      6.1     20.2       17862
     13,000:    242,889      8.1     25.0       16799
     12,000:    315,298     10.6     30.9       15765
     11,000:    378,731     13.7     37.3       14810
     10,000:    444,865     17.3     44.3       13912
      9,000:    520,277     21.5     51.6       13047
      8,000:    542,020     25.8     58.5       12278
      7,000:    556,461     30.3     64.7       11570
      6,000:    611,893     35.2     70.6       10859
      5,000:    694,694     40.8     76.2       10123
      4,000:    823,748     47.5     81.7        9334
      3,000:  1,061,032     56.0     87.2        8439
      2,000:  1,560,111     68.6     92.9        7344
      1,000:  2,574,667     89.4     98.5        5976
          0:  1,317,905    100.0    100.0        5424

OvlpHist_2.pdf

preassemble_stat

{
    "genome_length": 750000000,
    "length_cutoff": 13537,
    "preassembled_bases": 4214378579,
    "preassembled_coverage": 5.619,
    "preassembled_esize": 13574.367,
    "preassembled_mean": 12046.382,
    "preassembled_n50": 13708,
    "preassembled_p95": 19004,
    "preassembled_reads": 349846,
    "preassembled_seed_fragmentation": 1.22,
    "preassembled_seed_truncation": 2348.162,
    "preassembled_yield": 0.281,
    "raw_bases": 67280909359,
    "raw_coverage": 89.708,
    "raw_esize": 9764.231,
    "raw_mean": 5424.055,
    "raw_n50": 9220,
    "raw_p95": 14741,
    "raw_reads": 12404171,
    "seed_bases": 15000768227,
    "seed_coverage": 20.001,
    "seed_esize": 18308.29,
    "seed_mean": 17368.493,
    "seed_n50": 16782,
    "seed_p95": 25316,
    "seed_reads": 863677
}

p_ctg_stat

Total length of sequence:       400909193 bp
Total number of sequences:      9739
N25 stats:                      25% of total sequence length is contained in the 627 sequences >= 113996
bp
N50 stats:                      50% of total sequence length is contained in the 1783 sequences >= 68837
bp
N75 stats:                      75% of total sequence length is contained in the 3728 sequences >= 37641
bp
Total GC count:                 147692278 bp
GC %:                           36.84 %

Sep 22 '17 17:09 wyim-pgl

Hi,

You have sufficient raw coverage (almost 90-fold) so there should be plenty of raw data to get a decent draft assembly. What I see from your pre-asesmbled statistics, however is that you only have 5-6fold coverage of pre-assembled reads "preassembled_coverage": 5.619,

You will never achieve a contiguous or complete assembly with only 5-fold preads, you need closer to 15-25-fold pread coverage above a certain length threshold if you want to achieve a highly contiguous assembly.

You need to start by troubleshooting your pre-assembly. It may be beneficial to raise -e.70 to as much as -e.75 depending on the quality of your data.

Also, we generally don't recommend using the -a option in DBsplit : pa_DBsplit_option = -a -x500 -s400

Including the -a option will result in all subreads from all ZMWs being used. Excluding the -a option will result in only the best subread from a particular ZMW being used, which may have a large effect in a highly repetitive genome.

Hope this helps

Sep 22 '17 18:09 gconcepcion

I will restart from your recommendation and I will update results.

Thanks.

Sep 22 '17 18:09 wyim-pgl

Here is follow-up result.

I removed -a option. Change -e.70 to -e.75 change length_cutoff = -1 -> 3000 Looks not enough. Any other recommendation?

preads_stat

 cat 0-rawreads/report/pre_assembly_stats.json
{
    "genome_length": 750000000,
    "length_cutoff": 3000,
    "preassembled_bases": 6175542129,
    "preassembled_coverage": 8.234,
    "preassembled_esize": 11399.998,
    "preassembled_mean": 10113.361,
    "preassembled_n50": 10956,
    "preassembled_p95": 16653,
    "preassembled_reads": 610632,
    "preassembled_seed_fragmentation": 1.181,
    "preassembled_seed_truncation": 1101.777,
    "preassembled_yield": 0.13,
    "raw_bases": 50058837997,
    "raw_coverage": 66.745,
    "raw_esize": 11049.715,
    "raw_mean": 7445.514,
    "raw_n50": 10321,
    "raw_p95": 16912,
    "raw_reads": 6723356,
    "seed_bases": 47453855282,
    "seed_coverage": 63.272,
    "seed_esize": 11542.476,
    "seed_mean": 9011.791,
    "seed_n50": 10669,
    "seed_p95": 17884,
    "seed_reads": 5265752
}

p_ctg_stat

Total length of sequence:	439610496 bp
Total number of sequences:	10732
N25 stats:			25% of total sequence length is contained in the 628 sequences >= 123886
bp
N50 stats:			50% of total sequence length is contained in the 1805 sequences >= 72968
bp
N75 stats:			75% of total sequence length is contained in the 3855 sequences >= 39183
bp
Total GC count:			162129330 bp
GC %:				36.88 %

Sep 24 '17 17:09 wyim-pgl

I agree pread correction doesn't appear to be proceeding as efficiently as it should be, which is still what's limiting the assembly. However, I also notice in your latest assembly that you're not using the same full dataset as the previous two assemblies, so it's not a fair apples to apples comparison. (look at the Raw read stats - the latest version appears to be starting with roughly 23-fold less coverage (~90X vs ~67X))

One thing I notice is that between your initial assembly, and your second version is that you raised the raw read overlapping parameter from -l1000 to -l4800. Though there is slightly more corrected pread coverage from the second assembly, I have a feeling the very high overlap length is restricting the amount of data that is being corrected. With a raw n50 of only 9220 (in your first assemblies) restricting the raw read overlaps to -l4800 would exclude a significant portion of the data. Maybe you should drop this cutoff to -l1500 or -l2000 or so.

Sep 25 '17 21:09 gconcepcion

Greg,

I used same input but I don't know why did it happen. I just checked input.fofn but there's no differences. Anyway I will reduce to overlapping parameter.

Thank you for your help.

Won

Sep 26 '17 00:09 wyim-pgl

FALCON FALCON copied to clipboard

Assembled genome size under estimated

FALCON
FALCON copied to clipboard