FALCON icon indicating copy to clipboard operation
FALCON copied to clipboard

short preads and bad assembly with ~10X data

Open zhjilin opened this issue 8 years ago • 3 comments

Hi, I'm assembling a vertebrate genome (~380M) with ~10X raw data(from RSII) following this tutorial http://pb-falcon.readthedocs.io/en/latest/tutorial.html . I ran FALCON with the following config file (pasted below the stats), I only got 25220472 bp in p_ctg.fa and 427152 bp in a_ctg.fa

The length distribution of raw reads looks good to me, while the length distribution of preads doesn't. I assume the main reason is: low coverage ? I also tried several configs with low -min_cov (1,3,5) and length_cutoff(200,500,1000), none of them yielded a better result.

Some information about the genome: The estimated repeat content ~18%(based on a very close species), low heterozygosity (according to 17mer analysis with Illumina reads).

Can anyone help to give some suggestions on parameter setting to get a better assembly? Many thanks !

Some statistics:

#stats from raw_reads.db Statistics for all reads of length 500 bases or more 445,369 reads out of 767,222 ( 58.0%) 3,915,098,483 base pairs out of 5,605,090,355 ( 69.8%) 8,790 average read length 5,066 standard deviation Base composition: 0.278(A) 0.216(C) 0.232(G) 0.274(T)

#stats from preads.db Statistics for all reads of length 500 bases or more 445,369 reads out of 767,222 ( 58.0%) 3,915,098,483 base pairs out of 5,605,090,355 ( 69.8%) 8,790 average read length 5,066 standard deviation Base composition: 0.278(A) 0.216(C) 0.232(G) 0.274(T)

#length distribution of raw reads and preads.

RawReadHist2017-07-10.pdf PreadHist.2017-07-10.pdf

#Config [General] job_type = local input_fofn = input.fofn input_type = raw length_cutoff = 200 genome_size = 380000000 seed_coverage = 2

length_cutoff_pr = 10000

job_queue = cashewcore sge_option_da = -pe smp 5 -q %(job_queue)s sge_option_la = -pe smp 20 -q %(job_queue)s sge_option_cns = -pe smp 12 -q %(job_queue)s sge_option_pda = -pe smp 6 -q %(job_queue)s sge_option_pla = -pe smp 16 -q %(job_queue)s sge_option_fc = -pe smp 24 -q %(job_queue)s

default_concurrent_jobs = 20 da_concurrent_jobs = 20 la_concurrent_jobs = 20 cns_concurrent_jobs = 20 pda_concurrent_jobs = 20 pla_concurrent_jobs = 20 pa_HPCdaligner_option = -v -B4 -t16 -e.70 -k18 -l500 -s200 ovlp_HPCdaligner_option = -v -B4 -t32 -h60 -e.96 -k24 -l500 -s200 pa_DBsplit_option = -x500 -s200 ovlp_DBsplit_option = -x500 -s200 falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 1 --max_n_read 200 --n_core 6 overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 1 --bestn 10 --n_core 24

#pre_assembly_stats.json

{ "genome_length": 380000000, "length_cutoff": 200, "preassembled_bases": 998374961, "preassembled_coverage": 2.627, "preassembled_esize": 8943.158, "preassembled_mean": 5791.508, "preassembled_n50": 8665, "preassembled_p95": 13853, "preassembled_reads": 172386, "preassembled_seed_fragmentation": 1.3, "preassembled_seed_truncation": 1520.094, "preassembled_yield": 0.255, "raw_bases": 3915098483, "raw_coverage": 10.303, "raw_esize": 11711.279, "raw_mean": 8790.685, "raw_n50": 11360, "raw_p95": 17341, "raw_reads": 445369, "seed_bases": 3915098483, "seed_coverage": 10.303, "seed_esize": 11711.279, "seed_mean": 8790.685, "seed_n50": 11360, "seed_p95": 17341, "seed_reads": 445369 }

zhjilin avatar Jul 10 '17 09:07 zhjilin

Hi Zhjlin,

Yeah, you really need more coverage to have a successful assembly with Falcon. I would recommend 40-fold at the barest minimum, and most of the most successful customers use 50-fold or more. Tweaking parameters cannot make up for this large shortfall in coverage.

On Mon, Jul 10, 2017 at 2:03 AM, zhjilin [email protected] wrote:

Hi, I'm assembling a vertebrate genome (~380M) with ~10X raw data(from RSII) following this tutorial http://pb-falcon.readthedocs.io/en/latest/tutorial.html . I ran FALCON with the following config file (pasted below the stats), I only got 25220472 bp in p_ctg.fa and 427152 bp in a_ctg.fa

The length distribution of raw reads looks good to me, while the length distribution of preads doesn't. I assume the main reason is: low coverage ? I also tried several configs with low -min_cov (1,3,5) and length_cutoff(200,500,1000), none of them yielded a better result.

Some information about the genome: The estimated repeat content ~18%(based on a very close species), low heterozygosity (according to 17mer analysis with Illumina reads).

Can anyone help to give some suggestions on parameter setting to get a better assembly? Many thanks !

Some statistics:

#stats from raw_reads.db Statistics for all reads of length 500 bases or more 445,369 reads out of 767,222 ( 58.0%) 3,915,098,483 base pairs out of 5,605,090,355 ( 69.8%) 8,790 average read length 5,066 standard deviation Base composition: 0.278(A) 0.216(C) 0.232(G) 0.274(T)

#stats from preads.db Statistics for all reads of length 500 bases or more 445,369 reads out of 767,222 ( 58.0%) 3,915,098,483 base pairs out of 5,605,090,355 ( 69.8%) 8,790 average read length 5,066 standard deviation Base composition: 0.278(A) 0.216(C) 0.232(G) 0.274(T)

#length distribution of raw reads and preads.

RawReadHist2017-07-10.pdf https://github.com/PacificBiosciences/FALCON/files/1134840/RawReadHist2017-07-10.pdf PreadHist.2017-07-10.pdf https://github.com/PacificBiosciences/FALCON/files/1134841/PreadHist.2017-07-10.pdf

#Config [General] job_type = local input_fofn = input.fofn input_type = raw length_cutoff = 200 genome_size = 380000000 seed_coverage = 2

length_cutoff_pr = 10000

job_queue = cashewcore sge_option_da = -pe smp 5 -q %(job_queue)s sge_option_la = -pe smp 20 -q %(job_queue)s sge_option_cns = -pe smp 12 -q %(job_queue)s sge_option_pda = -pe smp 6 -q %(job_queue)s sge_option_pla = -pe smp 16 -q %(job_queue)s sge_option_fc = -pe smp 24 -q %(job_queue)s

default_concurrent_jobs = 20 da_concurrent_jobs = 20 la_concurrent_jobs = 20 cns_concurrent_jobs = 20 pda_concurrent_jobs = 20 pla_concurrent_jobs = 20 pa_HPCdaligner_option = -v -B4 -t16 -e.70 -k18 -l500 -s200 ovlp_HPCdaligner_option = -v -B4 -t32 -h60 -e.96 -k24 -l500 -s200 pa_DBsplit_option = -x500 -s200 ovlp_DBsplit_option = -x500 -s200 falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 1 --max_n_read 200 --n_core 6 overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 1 --bestn 10 --n_core 24

#pre_assembly_stats.json

{ "genome_length": 380000000, "length_cutoff": 200, "preassembled_bases": 998374961, "preassembled_coverage": 2.627, "preassembled_esize": 8943.158, "preassembled_mean": 5791.508, "preassembled_n50": 8665, "preassembled_p95": 13853, "preassembled_reads": 172386, "preassembled_seed_fragmentation": 1.3, "preassembled_seed_truncation": 1520.094, "preassembled_yield": 0.255, "raw_bases": 3915098483, "raw_coverage": 10.303, "raw_esize": 11711.279, "raw_mean": 8790.685, "raw_n50": 11360, "raw_p95": 17341, "raw_reads": 445369, "seed_bases": 3915098483, "seed_coverage": 10.303, "seed_esize": 11711.279, "seed_mean": 8790.685, "seed_n50": 11360, "seed_p95": 17341, "seed_reads": 445369 }

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON/issues/565, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvPPm3efsen7ldXZL8aIl1OFvFhDbmzks5sMejPgaJpZM4OSjb3 .

mseetin avatar Jul 10 '17 16:07 mseetin

@mseetin many thanks for the reply. Have to ask one more question: In my case, I don't have enough coverage to get a complete genome, however, I assume the preads generated here are still high quality reads ( with shorter length)? Thanks!

zhjilin avatar Jul 12 '17 06:07 zhjilin

Well, they'll likely have been partially polished, but I can't comment on their accuracy, as I'm not sure anyone has measured the accuracy of the preads for such low amounts of coverage. They will almost certainly be lower accuracy than the preads that come out from a Falcon run with recommended coverage.

On Tue, Jul 11, 2017 at 11:46 PM, zhjilin [email protected] wrote:

@mseetin https://github.com/mseetin many thanks for the reply. Have to ask one more question: In my case, I don't have enough coverage to get a complete genome, however, I assume the preads generated here are still high quality reads ( with shorter length)? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/PacificBiosciences/FALCON/issues/565#issuecomment-314670243, or mute the thread https://github.com/notifications/unsubscribe-auth/AJvPPquherh_rdXY-I-8lxADTvtzM2fTks5sNGurgaJpZM4OSjb3 .

mseetin avatar Jul 12 '17 15:07 mseetin