BESST icon indicating copy to clipboard operation
BESST copied to clipboard

Too many reads filtered out?

Open melop opened this issue 3 years ago • 3 comments

Hello, I previous used a set of mate pair libraries to scaffold Allpath-LG scaffolds and was quite successful. Now I got a new assembly from nanopore contigs, and try to apply the same scaffolding procedure. However it looks like most of the reads were discarded. What is a reason for this?

Thanks Ray

Statistics.txt

Initial number of contigs: 48716. Number of contigs discarded from further analysis (with -filter_contigs set to 10): 1 Time elapsed for reading in contig sequences:7.42810487747

PASS 1

-T 7107.0 -t 5673.0 Contamine mean before filtering : 3169.82170245 Contamine stddev before filtering: 22984.3175557 Contamine mean converged: 677.939688523 Contamine std_est converged: 1121.3850919

LIBRARY STATISTICS Mean of library set to: 2805.0 Standard deviation of library set to: 717.0 MP library PE contamination: Contamine rate (rev comp oriented) estimated to: False lib contamine mean (avg fragmentation size): 0 lib contamine stddev: 0 Number of contamined reads used for this calculation: 10081.0 -T (library insert size threshold) set to: 7107.0 -k set to (Scaffolding with contigs larger than): 5673.0 Number of links required to create an edge: None Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200 Read length set to: 62.41

Time elapsed for getting libmetrics, iteration 0: 2.3990881443

Parsing BAM file... L50: 2662 N50: 126232 Initial contig assembly length: 1505017490 Time initializing BESST objects: 0.231798887253 Total time elapsed for initializing Graph: 0.617565870285 Reading bam file and creating scaffold graph... ELAPSED reading file: 6581.09070301 NR OF FISHY READ LINKS: 139654 Number of USEFUL READS (reads mapping to different contigs uniquly): 338778484 Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 478966897 Reads with too large insert size from "USEFUL READS" (filtered out): 304923304 Initial number of edges in G (the graph with large contigs): 858809 Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 2204910 Number of duplicated reads indicated and removed: 26299652 Mean coverage before filtering out extreme observations = 150.34412238 Std dev of coverage before filtering out extreme observations= 888.146107052 Mean coverage after filtering = 0.0386320692138 Std coverage after filtering = 0.0212291479729 Length of longest contig in calc of coverage: 1578503 Length of shortest contig in calc of coverage: 5673 Detecting repeats.. Removed a total of: 43707 repeats. With coverage larger than 0.13706564204 Number of edges in G (after repeat removal): 1503 Number of edges in G_prime (after repeat removal): 5008 Number of BWA buggy edges removed: 0 Number of edges in G (after filtering for buggy flag stats reporting): 1503 Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008 Letting filtering threshold in high complexity regions be 5 for this library. Letting -e be 5 for this library. Removed 0 edges from graph G of border contigs. Remove edges in high complexity areas. Removed total of 0 edges in high density areas. Removed an additional of 0 edges with low support from full graph G_prime of all contigs. Number of significantly spurious edges: 0 Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008 Out of which 1503 acts as border contigs. Total time for CreateGraph-module, iteration 0: 6599.10073209

0 link edges created. Perform inference on scaffold graph... Remove isolated nodes. 1503 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core iterating until maximum of 0 extensions. Number of nodes:10016, Number of edges: 5008 Elapsed time single core pathfinder: 0.0146651268005 0 paths detected are with score greater or equal to 1.5 Nr of contigs left: 5008.0 Nr of linking edges left: 0.0 Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0 Time elapsed for making scaffolds, iteration 0: 0.152600049973

(super)Contigs after scaffolding: 5008

param value detect_haplotype False hit_path_threshold False lognormal False orientation rf gap_estimations [] hapl_threshold 3 gff_file None lower_cov_cutoff 0 path_gaps_estimated 0 expected_links_over_mean_plus_stddev 5 read_len 62.41 pass_number 1 path_threshold 100000 std_dev_coverage 0.0212291479729 mean_coverage 0.0386320692138 detect_duplicate True FASTER_ILP False development False std_dev_ins_size 717.0 NO_ILP False current_N50 126232 print_scores False mean_ins_size 2805.0 multiprocess False scaffold_indexer 48716 hapl_ratio 1.3 no_score True first_lib True current_L50 2662 plots False contigfile None cov_cutoff None contamination_ratio False ins_size_threshold 7107.0 edgesupport 5 extend_paths True tot_assembly_length 1505017490 max_extensions None score_cutoff 1.5 min_mapq 20 information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300> contamination_mean 0 max_contig_overlap 200 contig_threshold 6958 contamination_stddev 0 dfs_traversal True

PASS 2

-T 8421.0 -t 7243.0 Contamine mean before filtering : 644.91884719 Contamine stddev before filtering: 7618.78938119 Contamine mean converged: 323.448900031 Contamine std_est converged: 136.579587956

LIBRARY STATISTICS Mean of library set to: 4887.0 Standard deviation of library set to: 589.0 MP library PE contamination: Contamine rate (rev comp oriented) estimated to: 0.220716849845 lib contamine mean (avg fragmentation size): 323.448900031 lib contamine stddev: 136.579587956 Number of contamined reads used for this calculation: 97730.0 -T (library insert size threshold) set to: 8421.0 -k set to (Scaffolding with contigs larger than): 7243.0 Number of links required to create an edge: None Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200 Read length set to: 48.78

Time elapsed for getting libmetrics, iteration 1: 2.95900011063

Parsing BAM file... L50: 0 N50: 0 Initial contig assembly length: 1505017490 Nr of contigs/scaffolds that was singeled out due to length constraints 368 Time cleaning BESST objects for next library: 0.00483298301697 Total time elapsed for initializing Graph: 0.0218350887299 Reading bam file and creating scaffold graph... ELAPSED reading file: 325.734697104 NR OF FISHY READ LINKS: 0 Number of USEFUL READS (reads mapping to different contigs uniquly): 0 Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0 Reads with too large insert size from "USEFUL READS" (filtered out): 0 Initial number of edges in G (the graph with large contigs): 0 Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008 Number of duplicated reads indicated and removed: 0 Mean coverage before filtering out extreme observations = 0.00857007310694 Std dev of coverage before filtering out extreme observations= 0.0171089338111 Mean coverage after filtering = 9.06876577268e-05 Std coverage after filtering = 0.000461890485611 Length of longest contig in calc of coverage: 89136 Length of shortest contig in calc of coverage: 7243 Number of edges in G (after repeat removal): 0 Number of edges in G_prime (after repeat removal): 5008 Number of BWA buggy edges removed: 0 Number of edges in G (after filtering for buggy flag stats reporting): 0 Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008 Letting filtering threshold in high complexity regions be 5 for this library. Letting -e be 5 for this library. Removed 0 edges from graph G of border contigs. Remove edges in high complexity areas. Removed total of 0 edges in high density areas. Removed an additional of 0 edges with low support from full graph G_prime of all contigs. Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008 Out of which 1135 acts as border contigs. Total time for CreateGraph-module, iteration 1: 325.839869976

0 link edges created. Perform inference on scaffold graph... Remove isolated nodes. 0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core iterating until maximum of 0 extensions. Number of nodes:10016, Number of edges: 5008 Elapsed time single core pathfinder: 0.0115258693695 0 paths detected are with score greater or equal to 1.5 Nr of contigs left: 5008.0 Nr of linking edges left: 0.0 Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0 Time elapsed for making scaffolds, iteration 1: 0.149516105652

(super)Contigs after scaffolding: 5008

param value detect_haplotype False hit_path_threshold False lognormal False orientation rf gap_estimations [] hapl_threshold 3 gff_file None lower_cov_cutoff 0 path_gaps_estimated 0 expected_links_over_mean_plus_stddev 5 read_len 48.78 pass_number 2 path_threshold 100000 std_dev_coverage 0.000461890485611 mean_coverage 9.06876577268e-05 detect_duplicate True FASTER_ILP False development False std_dev_ins_size 589.0 NO_ILP False current_N50 0 print_scores False mean_ins_size 4887.0 multiprocess False scaffold_indexer 48716 hapl_ratio 1.3 no_score True first_lib False current_L50 0 plots False contigfile None cov_cutoff None contamination_ratio 0.220716849845 ins_size_threshold 8421.0 edgesupport 5 extend_paths True tot_assembly_length 1505017490 max_extensions None score_cutoff 1.5 min_mapq 20 information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300> contamination_mean 323.448900031 max_contig_overlap 200 contig_threshold 6958 contamination_stddev 136.579587956 dfs_traversal True

PASS 3

-T 13492.0 -t 11160.0 Contamine mean before filtering : 24633.7478754 Contamine stddev before filtering: 89658.0332439 Contamine mean converged: 6422.97819315 Contamine std_est converged: 4321.78894197

LIBRARY STATISTICS Mean of library set to: 6496.0 Standard deviation of library set to: 1166.0 MP library PE contamination: Contamine rate (rev comp oriented) estimated to: False lib contamine mean (avg fragmentation size): 0 lib contamine stddev: 0 Number of contamined reads used for this calculation: 321.0 -T (library insert size threshold) set to: 13492.0 -k set to (Scaffolding with contigs larger than): 11160.0 Number of links required to create an edge: None Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200 Read length set to: 168.11

Time elapsed for getting libmetrics, iteration 2: 3.24759888649

Parsing BAM file... L50: 0 N50: 0 Initial contig assembly length: 1505017490 Nr of contigs/scaffolds that was singeled out due to length constraints 486 Time cleaning BESST objects for next library: 0.00424909591675 Total time elapsed for initializing Graph: 0.0218479633331 Reading bam file and creating scaffold graph... ELAPSED reading file: 25.9454369545 NR OF FISHY READ LINKS: 0 Number of USEFUL READS (reads mapping to different contigs uniquly): 0 Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0 Reads with too large insert size from "USEFUL READS" (filtered out): 0 Initial number of edges in G (the graph with large contigs): 0 Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008 Number of duplicated reads indicated and removed: 0 Mean coverage before filtering out extreme observations = 0.00124294772942 Std dev of coverage before filtering out extreme observations= 0.00507329288087 Mean coverage after filtering = 0.00124294772942 Std coverage after filtering = 0.00507329288087 Length of longest contig in calc of coverage: 89136 Length of shortest contig in calc of coverage: 11160 Number of edges in G (after repeat removal): 0 Number of edges in G_prime (after repeat removal): 5008 Number of BWA buggy edges removed: 0 Number of edges in G (after filtering for buggy flag stats reporting): 0 Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008 Letting filtering threshold in high complexity regions be 5 for this library. Letting -e be 5 for this library. Removed 0 edges from graph G of border contigs. Remove edges in high complexity areas. Removed total of 0 edges in high density areas. Removed an additional of 0 edges with low support from full graph G_prime of all contigs. Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008 Out of which 649 acts as border contigs. Total time for CreateGraph-module, iteration 2: 26.0430119038

0 link edges created. Perform inference on scaffold graph... Remove isolated nodes. 0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core iterating until maximum of 0 extensions. Number of nodes:10016, Number of edges: 5008 Elapsed time single core pathfinder: 0.0116968154907 0 paths detected are with score greater or equal to 1.5 Nr of contigs left: 5008.0 Nr of linking edges left: 0.0 Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0 Time elapsed for making scaffolds, iteration 2: 0.221040964127

(super)Contigs after scaffolding: 5008

param value detect_haplotype False hit_path_threshold False lognormal False orientation rf gap_estimations [] hapl_threshold 3 gff_file None lower_cov_cutoff 0 path_gaps_estimated 0 expected_links_over_mean_plus_stddev 5 read_len 168.11 pass_number 3 path_threshold 100000 std_dev_coverage 0.00507329288087 mean_coverage 0.00124294772942 detect_duplicate True FASTER_ILP False development False std_dev_ins_size 1166.0 NO_ILP False current_N50 0 print_scores False mean_ins_size 6496.0 multiprocess False scaffold_indexer 48716 hapl_ratio 1.3 no_score True first_lib False current_L50 0 plots False contigfile None cov_cutoff None contamination_ratio False ins_size_threshold 13492.0 edgesupport 5 extend_paths True tot_assembly_length 1505017490 max_extensions None score_cutoff 1.5 min_mapq 20 information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300> contamination_mean 0 max_contig_overlap 200 contig_threshold 6958 contamination_stddev 0 dfs_traversal True

PASS 4

-T 42537.0 -t 32239.0 Contamine mean before filtering : 29519.8535565 Contamine stddev before filtering: 68616.3455717 Contamine mean converged: 16048.5931953 Contamine std_est converged: 7824.95943874

LIBRARY STATISTICS Mean of library set to: 11643.0 Standard deviation of library set to: 5149.0 MP library PE contamination: Contamine rate (rev comp oriented) estimated to: False lib contamine mean (avg fragmentation size): 0 lib contamine stddev: 0 Number of contamined reads used for this calculation: 676.0 -T (library insert size threshold) set to: 42537.0 -k set to (Scaffolding with contigs larger than): 32239.0 Number of links required to create an edge: None Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200 Read length set to: 152.81

Time elapsed for getting libmetrics, iteration 3: 3.09710383415

Parsing BAM file... L50: 0 N50: 0 Initial contig assembly length: 1505017490 Nr of contigs/scaffolds that was singeled out due to length constraints 589 Time cleaning BESST objects for next library: 0.00451493263245 Total time elapsed for initializing Graph: 0.0214760303497 Reading bam file and creating scaffold graph... ELAPSED reading file: 16.8508169651 NR OF FISHY READ LINKS: 0 Number of USEFUL READS (reads mapping to different contigs uniquly): 0 Number of non unique reads (at least one read non-unique in read pair) that maps to different contigs (filtered out from scaffolding): 0 Reads with too large insert size from "USEFUL READS" (filtered out): 0 Initial number of edges in G (the graph with large contigs): 0 Initial number of edges in G_prime (the full graph of all contigs before removal of repats): 5008 Number of duplicated reads indicated and removed: 0 Mean coverage before filtering out extreme observations = 0.00143064790959 Std dev of coverage before filtering out extreme observations= 0.00381765871116 Mean coverage after filtering = 0.00143064790959 Std coverage after filtering = 0.00381765871116 Length of longest contig in calc of coverage: 89136 Length of shortest contig in calc of coverage: 32418 Number of edges in G (after repeat removal): 0 Number of edges in G_prime (after repeat removal): 5008 Number of BWA buggy edges removed: 0 Number of edges in G (after filtering for buggy flag stats reporting): 0 Number of edges in G_prime (after filtering for buggy flag stats reporting): 5008 Letting filtering threshold in high complexity regions be 5 for this library. Letting -e be 5 for this library. Removed 0 edges from graph G of border contigs. Remove edges in high complexity areas. Removed total of 0 edges in high density areas. Removed an additional of 0 edges with low support from full graph G_prime of all contigs. Number of edges in G_prime (after removing edges under -e threshold (if not specified, default is -e 3): 5008


Nr of contigs/scaffolds included in this pass: 5008 Out of which 60 acts as border contigs. Total time for CreateGraph-module, iteration 3: 16.9412498474

0 link edges created. Perform inference on scaffold graph... Remove isolated nodes. 0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core iterating until maximum of 0 extensions. Number of nodes:10016, Number of edges: 5008 Elapsed time single core pathfinder: 0.0114350318909 0 paths detected are with score greater or equal to 1.5 Nr of contigs left: 5008.0 Nr of linking edges left: 0.0 Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0 Time elapsed for making scaffolds, iteration 3: 0.63897395134

(super)Contigs after scaffolding: 5008

param value detect_haplotype False hit_path_threshold False lognormal False orientation rf gap_estimations [] hapl_threshold 3 gff_file None lower_cov_cutoff 0 path_gaps_estimated 0 expected_links_over_mean_plus_stddev 5 read_len 152.81 pass_number 4 path_threshold 100000 std_dev_coverage 0.00381765871116 mean_coverage 0.00143064790959 detect_duplicate True FASTER_ILP False development False std_dev_ins_size 5149.0 NO_ILP False current_N50 0 print_scores False mean_ins_size 11643.0 multiprocess False scaffold_indexer 48716 hapl_ratio 1.3 no_score True first_lib False current_L50 0 plots False contigfile None cov_cutoff None contamination_ratio False ins_size_threshold 42537.0 edgesupport 5 extend_paths True tot_assembly_length 1505017490 max_extensions None score_cutoff 1.5 min_mapq 20 information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300> contamination_mean 0 max_contig_overlap 200 contig_threshold 6958 contamination_stddev 0 dfs_traversal True

PASS 5

-T 333115.0 -t 259485.0

LIBRARY STATISTICS Mean of library set to: 112225.0 Standard deviation of library set to: 36815.0 MP library PE contamination: Contamine rate (rev comp oriented) estimated to: False lib contamine mean (avg fragmentation size): 0 lib contamine stddev: 0 Number of contamined reads used for this calculation: 0.0 -T (library insert size threshold) set to: 333115.0 -k set to (Scaffolding with contigs larger than): 259485.0 Number of links required to create an edge: None Maximum identical contig-end overlap-length to merge of contigs that are adjacent in a scaffold: 200 Read length set to: 719.6

Time elapsed for getting libmetrics, iteration 4: 0.472044944763

Parsing BAM file... L50: 0 N50: 0 Initial contig assembly length: 1505017490 Nr of contigs/scaffolds that was singeled out due to length constraints 60 Time cleaning BESST objects for next library: 0.00434803962708 Total time for CreateGraph-module, iteration 4: 0.00948882102966

0 link edges created. Perform inference on scaffold graph... Remove isolated nodes. 0 isolated contigs removed from graph.

Searching for paths BETWEEN scaffolds

Entering ELS.BetweenScaffolds single core iterating until maximum of 0 extensions. Number of nodes:0, Number of edges: 0 Elapsed time single core pathfinder: 4.2200088501e-05 0 paths detected are with score greater or equal to 1.5 Nr of contigs left: 0.0 Nr of linking edges left: 0.0 Number of gaps estimated by GapEst-LP module order_contigs in this step is: 0 Time elapsed for making scaffolds, iteration 4: 5.06741690636

(super)Contigs after scaffolding: 5008

param value detect_haplotype False hit_path_threshold False lognormal False orientation fr gap_estimations [] hapl_threshold 3 gff_file None lower_cov_cutoff 0 path_gaps_estimated 0 expected_links_over_mean_plus_stddev 5 read_len 719.6 pass_number 5 path_threshold 100000 std_dev_coverage 0.00381765871116 mean_coverage 0.00143064790959 detect_duplicate True FASTER_ILP False development False std_dev_ins_size 36815.0 NO_ILP False current_N50 0 print_scores False mean_ins_size 112225.0 multiprocess False scaffold_indexer 48716 hapl_ratio 1.3 no_score True first_lib False current_L50 0 plots False contigfile None cov_cutoff None contamination_ratio False ins_size_threshold 333115.0 edgesupport None extend_paths True tot_assembly_length 1505017490 max_extensions None score_cutoff 1.5 min_mapq 20 information_file <open file 'scaffold//BESST_output/Statistics.txt', mode 'w' at 0x7f6549be1300> contamination_mean 0 max_contig_overlap 200 contig_threshold 6958 contamination_stddev 0 dfs_traversal True

L50: 0 N50: 0 Initial contig assembly length: 1505017490 Total time for scaffolding: 7012.52787113

melop avatar Jul 05 '20 13:07 melop

Hi Ray,

Looks like all contigs are filtered out due to highly variable coverage (or unstable algorithm in BESST to infer the mode of such distribution). To fix this, simply set -z 10000 10000 10000 10000 10000. This will ignore filtering out contigs with very high coverage for all the 5 libraries (you can set the values of these as desired, 10000 is just an example).

Let me know if this works.

Best, K

ksahlin avatar Jul 06 '20 19:07 ksahlin

Thank you for the quick reply! Do you think it has something to do with me setting the min-mapq to 20? Do you think I should instead set this to 0 so that reads mapped to repetitive regions would also be considered?

Ray

melop avatar Jul 07 '20 08:07 melop

Sure, that might be a good idea to try! Let me know how it goes.

ksahlin avatar Jul 08 '20 17:07 ksahlin