Flye
Flye copied to clipboard
Flye does not generate any output ("No disjointigs were assembled" message)
I have been trying to assemble a 10Mb genome with uncorrected nanopore data (3-4 chromosomes expected). We have a lot of data, is that the reason Flye fails at the end?
[2019-06-22 11:00:05] INFO: >>>STAGE: configure [2019-06-22 11:00:05] INFO: Configuring run [2019-06-22 11:00:27] INFO: Total read length: 10964270213 [2019-06-22 11:00:27] INFO: Input genome size: 10000000 [2019-06-22 11:00:27] INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] INFO: >>>STAGE: assembly [2019-06-22 11:00:27] INFO: Assembling disjointigs [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:08:39] INFO: Filling index table 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:22:11] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
flye --nano-raw one.fastq --out-dir flye --genome-size 10m --threads 20
Interesting, looks like indeed a lot of overlaps were found, but no disjointigs were assembled. Is it possible to send me the full flye.log? I also suggest to try --meta mode - it is more robust to solid k-mer selection in case there is any contamination / instrumental artificial sequence.
[2019-06-22 11:00:05] root: INFO: Starting Flye 2.4.2-release [2019-06-22 11:00:05] root: DEBUG: Cmd: /home/stelo/miniconda2/bin/flye --nano-raw Bduncani_06182019_pass.fastq --out-dir babesia_flye --genome-size 10m --threads 20 [2019-06-22 11:00:05] root: INFO: >>>STAGE: configure [2019-06-22 11:00:05] root: INFO: Configuring run [2019-06-22 11:00:27] root: INFO: Total read length: 10964270213 [2019-06-22 11:00:27] root: INFO: Input genome size: 10000000 [2019-06-22 11:00:27] root: INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] root: WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] root: INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] root: INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] root: INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] root: INFO: >>>STAGE: assembly [2019-06-22 11:00:27] root: INFO: Assembling disjointigs [2019-06-22 11:00:27] root: DEBUG: -----Begin assembly log------ [2019-06-22 11:00:27] root: DEBUG: Running: flye-assemble -l /24-2/home/stelo/babesia/babesia_flye/flye.log -t 20 -v 5000 -k 15 Bduncani_06182019_pas s.fastq /24-2/home/stelo/babesia/babesia_flye/00-assembly/draft_assembly.fasta 10000000 /home/stelo/miniconda2/lib/python2.7/site-packages/flye/confi g/bin_cfg/asm_raw_reads.cfg [2019-06-22 11:00:27] DEBUG: Build date: Apr 7 2019 02:34:37 [2019-06-22 11:00:27] DEBUG: Total RAM: 251 Gb [2019-06-22 11:00:27] DEBUG: Available RAM: 245 Gb [2019-06-22 11:00:27] DEBUG: Total CPUs: 40 [2019-06-22 11:00:27] DEBUG: Parameters: [2019-06-22 11:00:27] DEBUG: big_genome_threshold=29000000 [2019-06-22 11:00:27] DEBUG: low_cutoff_warning=1 [2019-06-22 11:00:27] DEBUG: hard_min_coverage_rate=10 [2019-06-22 11:00:27] DEBUG: assemble_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: repeat_graph_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: read_align_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: maximum_jump=1500 [2019-06-22 11:00:27] DEBUG: maximum_overhang=1500 [2019-06-22 11:00:27] DEBUG: repeat_kmer_rate=100 [2019-06-22 11:00:27] DEBUG: assemble_ovlp_divergence=0.30 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_divergence=0.15 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_end_adjust=0.00 [2019-06-22 11:00:27] DEBUG: read_align_ovlp_divergence=0.25 [2019-06-22 11:00:27] DEBUG: max_coverage_drop_rate=5 [2019-06-22 11:00:27] DEBUG: chimera_window=100 [2019-06-22 11:00:27] DEBUG: min_reads_in_disjointig=4 [2019-06-22 11:00:27] DEBUG: max_inner_reads=10 [2019-06-22 11:00:27] DEBUG: max_inner_fraction=0.25 [2019-06-22 11:00:27] DEBUG: add_unassembled_reads=0 [2019-06-22 11:00:27] DEBUG: max_separation=500 [2019-06-22 11:00:27] DEBUG: tip_length_threshold=100000 [2019-06-22 11:00:27] DEBUG: unique_edge_length=50000 [2019-06-22 11:00:27] DEBUG: min_repeat_res_support=0.51 [2019-06-22 11:00:27] DEBUG: out_paths_ratio=5 [2019-06-22 11:00:27] DEBUG: graph_cov_drop_rate=10 [2019-06-22 11:00:27] DEBUG: coverage_estimate_window=100 [2019-06-22 11:00:27] DEBUG: extend_contigs_with_repeats=1 [2019-06-22 11:00:27] DEBUG: Running with k-mer size: 15 [2019-06-22 11:00:27] DEBUG: Running with minimum overlap 5000 [2019-06-22 11:00:27] DEBUG: Metagenome mode: N [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] DEBUG: Building positional index [2019-06-22 11:01:01] DEBUG: Total sequence: 10964270213 bp [2019-06-22 11:01:01] DEBUG: Expected read coverage: 1096 [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:01] DEBUG: Hard threshold set to 5 [2019-06-22 11:01:01] DEBUG: Started k-mer counting [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): [2019-06-22 11:08:39] DEBUG: Estimated minimum kmer coverage: 155 [2019-06-22 11:08:39] DEBUG: Filtered 301351751 erroneous k-mers [2019-06-22 11:08:39] DEBUG: Repetitive k-mer frequency: 55681 [2019-06-22 11:08:39] DEBUG: Filtered 897 repetitive k-mers (8.98678e-05) [2019-06-22 11:08:39] INFO: Filling index table [2019-06-22 11:08:44] DEBUG: Sampling rate: 1 [2019-06-22 11:08:44] DEBUG: Solid k-mers: 9980428 [2019-06-22 11:08:44] DEBUG: K-mer index size: 5380562281 [2019-06-22 11:08:44] DEBUG: Mean k-mer frequency: 539.111 [2019-06-22 11:12:31] DEBUG: Sorting k-mer index [2019-06-22 11:13:50] DEBUG: Peak RAM usage: 28 Gb [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 11:13:50] DEBUG: Estimating overlap coverage [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 [2019-06-22 12:54:29] DEBUG: Sequence divergence distribution:
| *
| *
| * *
| ** **
| *****
| ******
| ********
| ********
| *********
| *********
| ***********
| ************
| ************* *
| ************* *
| ************* *
| ***************** *
| *********************
| **********************
| *************************
| **************************************** * * ** *
----------------------------------------------------------------------------------------------------
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Q25 = 0.1, Q50 = 0.12, Q75 = 0.14
[2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:20:23] DEBUG: Writing FASTA [2019-06-23 17:20:23] DEBUG: Peak RAM usage: 78 Gb -----------End assembly log------------ [2019-06-23 17:22:11] root: ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
Thank you, indeed looks strange. Maybe high coverage confuses Flye, but I also suspect there might be some non-target reads in the sample.
I suggest to try two more runs (i) metagenome mode (ii) normal mode with --asm-coverage 50
to use the longest 50x reads for disjointig assembly. Please post the corresponding logs as well.
I just finished running Flye using the two runs that you suggest. Both of them completed, but the assembly with ''--asm-coverage 50'' seems better (in terms of N50, total size, etc.) Thank you
Glad that it helped!
The solution of normal mode with --asm-coverage 50
has helped in a similar case where lots of overlap is found but no disjointigs are assembled for a plasmid!
@fenderglass Could you please take a quick look at the log output for the sample where flye fails to assemble disjointigs: gist.github.com/ptrebert/3964d66cd60af3e7a19d95d166707ed2
Since I am running flye with --asm-coverage 50
by default, I am a bit unsure how to proceed with this sample.
@ptrebert Seems strange. My only guess would be that PacBio reads might not be properly split into subreads (we had a couple cases like that before). Try to process the reads with https://github.com/fenderglass/pbclip - it should tell you if there is a significant amount of "chimeric" subreads.
Alternatively, you can also try to run with --meta
option if the reads turn out ok.
@fenderglass Ok, thanks for pointing out your tool, I'll check that and get back to you.
ping: testing Flye 2.7b-b1562
on sample with no disjointigs assembled - still running...
@fenderglass For my problematic sample, flye 2.7b did not solve the issue (same "no disjointigs assembled"). I followed your suggestion and used your pbclip tool, which finished and reported the following:
Good: 15725667 chopped: 409754 bad: 662955
Could you help with interpreting these numbers (I may want to get in touch with the seq lab about this sample)? I'll try to assemble to output FASTA now with flye v2.7b, let's see what happens.
@ptrebert
pbclip finds PacBio reads that were not properly split into subreads. Depending on the DNA library, polymerase might make multiple passes over the fragment (which is used to produce high quality CCS reads). However, fragments in CLR libraries (at least from the assembly perspective) are not expected to be read multiple times to produce longer reads. When multiple passes does happen, such reads should be split into subreads (each subread is a single polymerase pass). Typically this is handled by the PacBio software at the FASTQ generation stage.
The numbers suggest that ~40% of your reads have multiple polymerase passes. This is a lot (typical value could be 1-2%) and suggests that there is indeed an issue with subread splitting. The number of chopped reads are those reads that pbclip was able to split into parts successfully. The bad reads are the reads with the same pattern that pbclip was not able to recover.
Feel free to run the latest Flye version on the output produced by pbclip - I think it it should work now. You can also double check with the lab if they performed subread splitting or have raw PacBio files to regenerate valid Fastqs.
@fenderglass Thanks a lot for your detailed explanation. I am not sure, however, I can follow your argument about the 40% "bad" reads: Total: 16798376 Bad = chopped + bad = 409754 + 662955 = 1072709 % bad = 1072709 / 16798376 ~6.4% Am I missing something, or did you just misread the "bad" number as 6 million instead of 600k? In either case, thanks again for all your input, that is very valuable. I'll update this issue as soon as I have the 2.7b results for the corrected reads.
probably last comment regarding this: even with the corrected reads (FASTA input now), flye 2.7b fails to assemble disjointigs. Seems like there is something else off about this data...
@ptrebert I see - this could be tricky sometimes. Did you have any luck with other assemblers? Wtdbg2 might be a fast way to check.
@fenderglass If I find the time, I'll try another assembler. For now, I asked the sequencing centre to double-check everything about this particular sample, let's see if they find something...
@fenderglass A postdoc in the sequencing center that produced the problematic data in the first place ran a couple of tests with different input combinations, and also with wtdbg2 as a comparison. Since none of those test runs produced an assembly, it seems fairly clear that the problem is the data. Just out of curiosity, since we have all the flye logs for the different runs, is there any statistic in those log files that could tell us anything about the problem(s) in the data? To me, they all look pretty similar (well, they all failed), so just being thorough here...
@ptrebert good to know, thanks for the update! At this early stage of assembly, not much could be inferred from the logs, I think.. I guess it the log shows that "Overlap-based coverage" is reasonable (let's say, >10), but no disjointigs are produced, then there is a problem somewhere.
No, they all show a zero for the "overlap-based coverage". Whatever the problem is, it's in the data then... thanks for all your support!
Hello All, I am working an Mycobacterium ulcerans genome which was sequenced with oxford nanopore technology. I am trying to do denovo assembly with flye but I run into a warning and the pipeline stops . The command I used is
flye --nano-raw filename.fa -o outdir -g 0.05m -t 34 -i 2
I get this message below
WARNING: Expected read coverage is 4744, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? Pipeline aborted
@jotes35 your expected genome size is 50kb (0.05 Mb). It needs to be "5m", not "0.05m" (assuming you are aiming for 5 Mb genome).
Please is there a way to know the expected genome size before hand?
@fenderglass is there a way to know the expected genome size before starting the assembly?
@jotes35 Please check the FAQ - it provides some answers to your question. Let me know if anything us unclear.
Hello, I have the same problem "No disjointigs were assembled". Expected genome is 110M and my expected coverage is about 49, I tried --meta and different --asm-coverage (since my over all coverage is smaller than 50x) but it didn't solve the issue. My N50 is quite high, would that be the reason I am getting the error? P40.pdf
@eyad. This is what worked for me I looked up the genome size of my organism (in my case 6.5mb)In the flye software, Flye still raised the flag. I reduced to 5M and the message did not come up again.
@eyayd could you post the log of the run with --meta
option?
Thank you very much for your prompt reply!
I am afraid I don't have it anymore. I am re-running it now, will post it asap. I am also currently trying the 2.7.1 version.
I have another nanopore run of the same genome which has less coverage and a bit smaller N50. Flye finds less overlaps and runs with no error. I am posting the log of that sample, incase helpful. N6_G344.pdf
@eyayd Somethig might not be right with your sample. Your expected genome size is 100m, and the coverage should be roughly 50x. But based on overlaps, the coverage is 600x - so this does not add up. No disjintings were assembled, which means that even though there were sufficient coverage, there were no reads that could be joined into contigouos fragments. For example, this is what you might see from amplicon sequenceing or PCR-based selection.
If you could share more details about your sample, I might have more insights.