Flye icon indicating copy to clipboard operation
Flye copied to clipboard

Flye does not generate any output ("No disjointigs were assembled" message)

Open StefanoLonardi opened this issue 5 years ago • 92 comments

I have been trying to assemble a 10Mb genome with uncorrected nanopore data (3-4 chromosomes expected). We have a lot of data, is that the reason Flye fails at the end?

[2019-06-22 11:00:05] INFO: >>>STAGE: configure [2019-06-22 11:00:05] INFO: Configuring run [2019-06-22 11:00:27] INFO: Total read length: 10964270213 [2019-06-22 11:00:27] INFO: Input genome size: 10000000 [2019-06-22 11:00:27] INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] INFO: >>>STAGE: assembly [2019-06-22 11:00:27] INFO: Assembling disjointigs [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:08:39] INFO: Filling index table 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% [2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:22:11] ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

flye --nano-raw one.fastq --out-dir flye --genome-size 10m --threads 20

StefanoLonardi avatar Jun 24 '19 01:06 StefanoLonardi

Interesting, looks like indeed a lot of overlaps were found, but no disjointigs were assembled. Is it possible to send me the full flye.log? I also suggest to try --meta mode - it is more robust to solid k-mer selection in case there is any contamination / instrumental artificial sequence.

mikolmogorov avatar Jun 25 '19 06:06 mikolmogorov

[2019-06-22 11:00:05] root: INFO: Starting Flye 2.4.2-release [2019-06-22 11:00:05] root: DEBUG: Cmd: /home/stelo/miniconda2/bin/flye --nano-raw Bduncani_06182019_pass.fastq --out-dir babesia_flye --genome-size 10m --threads 20 [2019-06-22 11:00:05] root: INFO: >>>STAGE: configure [2019-06-22 11:00:05] root: INFO: Configuring run [2019-06-22 11:00:27] root: INFO: Total read length: 10964270213 [2019-06-22 11:00:27] root: INFO: Input genome size: 10000000 [2019-06-22 11:00:27] root: INFO: Estimated coverage: 1096 [2019-06-22 11:00:27] root: WARNING: Expected read coverage is 1096, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2019-06-22 11:00:27] root: INFO: Reads N50/N90: 29675 / 9753 [2019-06-22 11:00:27] root: INFO: Minimum overlap set to 5000 [2019-06-22 11:00:27] root: INFO: Selected k-mer size: 15 [2019-06-22 11:00:27] root: INFO: >>>STAGE: assembly [2019-06-22 11:00:27] root: INFO: Assembling disjointigs [2019-06-22 11:00:27] root: DEBUG: -----Begin assembly log------ [2019-06-22 11:00:27] root: DEBUG: Running: flye-assemble -l /24-2/home/stelo/babesia/babesia_flye/flye.log -t 20 -v 5000 -k 15 Bduncani_06182019_pas s.fastq /24-2/home/stelo/babesia/babesia_flye/00-assembly/draft_assembly.fasta 10000000 /home/stelo/miniconda2/lib/python2.7/site-packages/flye/confi g/bin_cfg/asm_raw_reads.cfg [2019-06-22 11:00:27] DEBUG: Build date: Apr 7 2019 02:34:37 [2019-06-22 11:00:27] DEBUG: Total RAM: 251 Gb [2019-06-22 11:00:27] DEBUG: Available RAM: 245 Gb [2019-06-22 11:00:27] DEBUG: Total CPUs: 40 [2019-06-22 11:00:27] DEBUG: Parameters: [2019-06-22 11:00:27] DEBUG: big_genome_threshold=29000000 [2019-06-22 11:00:27] DEBUG: low_cutoff_warning=1 [2019-06-22 11:00:27] DEBUG: hard_min_coverage_rate=10 [2019-06-22 11:00:27] DEBUG: assemble_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: repeat_graph_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: read_align_kmer_sample=1 [2019-06-22 11:00:27] DEBUG: maximum_jump=1500 [2019-06-22 11:00:27] DEBUG: maximum_overhang=1500 [2019-06-22 11:00:27] DEBUG: repeat_kmer_rate=100 [2019-06-22 11:00:27] DEBUG: assemble_ovlp_divergence=0.30 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_divergence=0.15 [2019-06-22 11:00:27] DEBUG: repeat_graph_ovlp_end_adjust=0.00 [2019-06-22 11:00:27] DEBUG: read_align_ovlp_divergence=0.25 [2019-06-22 11:00:27] DEBUG: max_coverage_drop_rate=5 [2019-06-22 11:00:27] DEBUG: chimera_window=100 [2019-06-22 11:00:27] DEBUG: min_reads_in_disjointig=4 [2019-06-22 11:00:27] DEBUG: max_inner_reads=10 [2019-06-22 11:00:27] DEBUG: max_inner_fraction=0.25 [2019-06-22 11:00:27] DEBUG: add_unassembled_reads=0 [2019-06-22 11:00:27] DEBUG: max_separation=500 [2019-06-22 11:00:27] DEBUG: tip_length_threshold=100000 [2019-06-22 11:00:27] DEBUG: unique_edge_length=50000 [2019-06-22 11:00:27] DEBUG: min_repeat_res_support=0.51 [2019-06-22 11:00:27] DEBUG: out_paths_ratio=5 [2019-06-22 11:00:27] DEBUG: graph_cov_drop_rate=10 [2019-06-22 11:00:27] DEBUG: coverage_estimate_window=100 [2019-06-22 11:00:27] DEBUG: extend_contigs_with_repeats=1 [2019-06-22 11:00:27] DEBUG: Running with k-mer size: 15 [2019-06-22 11:00:27] DEBUG: Running with minimum overlap 5000 [2019-06-22 11:00:27] DEBUG: Metagenome mode: N [2019-06-22 11:00:27] INFO: Reading sequences [2019-06-22 11:01:01] DEBUG: Building positional index [2019-06-22 11:01:01] DEBUG: Total sequence: 10964270213 bp [2019-06-22 11:01:01] DEBUG: Expected read coverage: 1096 [2019-06-22 11:01:01] INFO: Generating solid k-mer index [2019-06-22 11:01:01] DEBUG: Hard threshold set to 5 [2019-06-22 11:01:01] DEBUG: Started k-mer counting [2019-06-22 11:01:17] INFO: Counting k-mers (1/2): [2019-06-22 11:02:49] INFO: Counting k-mers (2/2): [2019-06-22 11:08:39] DEBUG: Estimated minimum kmer coverage: 155 [2019-06-22 11:08:39] DEBUG: Filtered 301351751 erroneous k-mers [2019-06-22 11:08:39] DEBUG: Repetitive k-mer frequency: 55681 [2019-06-22 11:08:39] DEBUG: Filtered 897 repetitive k-mers (8.98678e-05) [2019-06-22 11:08:39] INFO: Filling index table [2019-06-22 11:08:44] DEBUG: Sampling rate: 1 [2019-06-22 11:08:44] DEBUG: Solid k-mers: 9980428 [2019-06-22 11:08:44] DEBUG: K-mer index size: 5380562281 [2019-06-22 11:08:44] DEBUG: Mean k-mer frequency: 539.111 [2019-06-22 11:12:31] DEBUG: Sorting k-mer index [2019-06-22 11:13:50] DEBUG: Peak RAM usage: 28 Gb [2019-06-22 11:13:50] INFO: Extending reads [2019-06-22 11:13:50] DEBUG: Estimating overlap coverage [2019-06-22 12:54:29] INFO: Overlap-based coverage: 1177 [2019-06-22 12:54:29] INFO: Median overlap divergence: 0.119637 [2019-06-22 12:54:29] DEBUG: Sequence divergence distribution:

|                      *
|                      *
|                    * *
|                   ** **
|                   *****
|                   ******
|                   ********
|                   ********
|                  *********
|                  *********
|                  ***********
|                 ************
|                 ************* *
|                 ************* *
|                 ************* *
|                *****************  *
|                *********************
|                **********************
|               *************************
|             **************************************** * *     ** *
----------------------------------------------------------------------------------------------------
0%        5%        10%       15%       20%       25%       30%       35%       40%       45%

Q25 = 0.1, Q50 = 0.12, Q75 = 0.14

[2019-06-23 17:20:11] INFO: Assembled 0 disjointigs [2019-06-23 17:20:23] INFO: Generating sequence [2019-06-23 17:20:23] DEBUG: Writing FASTA [2019-06-23 17:20:23] DEBUG: Peak RAM usage: 78 Gb -----------End assembly log------------ [2019-06-23 17:22:11] root: ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct

StefanoLonardi avatar Jun 25 '19 15:06 StefanoLonardi

Thank you, indeed looks strange. Maybe high coverage confuses Flye, but I also suspect there might be some non-target reads in the sample.

I suggest to try two more runs (i) metagenome mode (ii) normal mode with --asm-coverage 50 to use the longest 50x reads for disjointig assembly. Please post the corresponding logs as well.

mikolmogorov avatar Jun 27 '19 21:06 mikolmogorov

I just finished running Flye using the two runs that you suggest. Both of them completed, but the assembly with ''--asm-coverage 50'' seems better (in terms of N50, total size, etc.) Thank you

StefanoLonardi avatar Jul 18 '19 19:07 StefanoLonardi

Glad that it helped!

mikolmogorov avatar Jul 21 '19 17:07 mikolmogorov

The solution of normal mode with --asm-coverage 50 has helped in a similar case where lots of overlap is found but no disjointigs are assembled for a plasmid!

dgiguer avatar Nov 14 '19 19:11 dgiguer

@fenderglass Could you please take a quick look at the log output for the sample where flye fails to assemble disjointigs: gist.github.com/ptrebert/3964d66cd60af3e7a19d95d166707ed2

Since I am running flye with --asm-coverage 50 by default, I am a bit unsure how to proceed with this sample.

ptrebert avatar Feb 13 '20 10:02 ptrebert

@ptrebert Seems strange. My only guess would be that PacBio reads might not be properly split into subreads (we had a couple cases like that before). Try to process the reads with https://github.com/fenderglass/pbclip - it should tell you if there is a significant amount of "chimeric" subreads.

Alternatively, you can also try to run with --meta option if the reads turn out ok.

mikolmogorov avatar Feb 13 '20 20:02 mikolmogorov

@fenderglass Ok, thanks for pointing out your tool, I'll check that and get back to you.

ptrebert avatar Feb 14 '20 11:02 ptrebert

ping: testing Flye 2.7b-b1562 on sample with no disjointigs assembled - still running...

ptrebert avatar Feb 19 '20 08:02 ptrebert

@fenderglass For my problematic sample, flye 2.7b did not solve the issue (same "no disjointigs assembled"). I followed your suggestion and used your pbclip tool, which finished and reported the following:

Good: 15725667 chopped: 409754 bad: 662955

Could you help with interpreting these numbers (I may want to get in touch with the seq lab about this sample)? I'll try to assemble to output FASTA now with flye v2.7b, let's see what happens.

ptrebert avatar Feb 26 '20 07:02 ptrebert

@ptrebert

pbclip finds PacBio reads that were not properly split into subreads. Depending on the DNA library, polymerase might make multiple passes over the fragment (which is used to produce high quality CCS reads). However, fragments in CLR libraries (at least from the assembly perspective) are not expected to be read multiple times to produce longer reads. When multiple passes does happen, such reads should be split into subreads (each subread is a single polymerase pass). Typically this is handled by the PacBio software at the FASTQ generation stage.

The numbers suggest that ~40% of your reads have multiple polymerase passes. This is a lot (typical value could be 1-2%) and suggests that there is indeed an issue with subread splitting. The number of chopped reads are those reads that pbclip was able to split into parts successfully. The bad reads are the reads with the same pattern that pbclip was not able to recover.

Feel free to run the latest Flye version on the output produced by pbclip - I think it it should work now. You can also double check with the lab if they performed subread splitting or have raw PacBio files to regenerate valid Fastqs.

mikolmogorov avatar Feb 27 '20 01:02 mikolmogorov

@fenderglass Thanks a lot for your detailed explanation. I am not sure, however, I can follow your argument about the 40% "bad" reads: Total: 16798376 Bad = chopped + bad = 409754 + 662955 = 1072709 % bad = 1072709 / 16798376 ~6.4% Am I missing something, or did you just misread the "bad" number as 6 million instead of 600k? In either case, thanks again for all your input, that is very valuable. I'll update this issue as soon as I have the 2.7b results for the corrected reads.

ptrebert avatar Feb 27 '20 08:02 ptrebert

probably last comment regarding this: even with the corrected reads (FASTA input now), flye 2.7b fails to assemble disjointigs. Seems like there is something else off about this data...

ptrebert avatar Mar 02 '20 13:03 ptrebert

@ptrebert I see - this could be tricky sometimes. Did you have any luck with other assemblers? Wtdbg2 might be a fast way to check.

mikolmogorov avatar Mar 02 '20 17:03 mikolmogorov

@fenderglass If I find the time, I'll try another assembler. For now, I asked the sequencing centre to double-check everything about this particular sample, let's see if they find something...

ptrebert avatar Mar 03 '20 09:03 ptrebert

@fenderglass A postdoc in the sequencing center that produced the problematic data in the first place ran a couple of tests with different input combinations, and also with wtdbg2 as a comparison. Since none of those test runs produced an assembly, it seems fairly clear that the problem is the data. Just out of curiosity, since we have all the flye logs for the different runs, is there any statistic in those log files that could tell us anything about the problem(s) in the data? To me, they all look pretty similar (well, they all failed), so just being thorough here...

ptrebert avatar Mar 17 '20 10:03 ptrebert

@ptrebert good to know, thanks for the update! At this early stage of assembly, not much could be inferred from the logs, I think.. I guess it the log shows that "Overlap-based coverage" is reasonable (let's say, >10), but no disjointigs are produced, then there is a problem somewhere.

mikolmogorov avatar Mar 17 '20 20:03 mikolmogorov

No, they all show a zero for the "overlap-based coverage". Whatever the problem is, it's in the data then... thanks for all your support!

ptrebert avatar Mar 19 '20 13:03 ptrebert

Hello All, I am working an Mycobacterium ulcerans genome which was sequenced with oxford nanopore technology. I am trying to do denovo assembly with flye but I run into a warning and the pipeline stops . The command I used is
flye --nano-raw filename.fa -o outdir -g 0.05m -t 34 -i 2

I get this message below

WARNING: Expected read coverage is 4744, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? Pipeline aborted

vappiah avatar May 21 '20 16:05 vappiah

@jotes35 your expected genome size is 50kb (0.05 Mb). It needs to be "5m", not "0.05m" (assuming you are aiming for 5 Mb genome).

mikolmogorov avatar May 23 '20 00:05 mikolmogorov

Please is there a way to know the expected genome size before hand?

vappiah avatar May 23 '20 01:05 vappiah

@fenderglass is there a way to know the expected genome size before starting the assembly?

vappiah avatar May 26 '20 02:05 vappiah

@jotes35 Please check the FAQ - it provides some answers to your question. Let me know if anything us unclear.

mikolmogorov avatar May 27 '20 21:05 mikolmogorov

Hello, I have the same problem "No disjointigs were assembled". Expected genome is 110M and my expected coverage is about 49, I tried --meta and different --asm-coverage (since my over all coverage is smaller than 50x) but it didn't solve the issue. My N50 is quite high, would that be the reason I am getting the error? P40.pdf

eyayd avatar Jun 03 '20 09:06 eyayd

@eyad. This is what worked for me I looked up the genome size of my organism (in my case 6.5mb)In the flye software, Flye still raised the flag. I reduced to 5M and the message did not come up again.

vappiah avatar Jun 03 '20 09:06 vappiah

@eyayd could you post the log of the run with --meta option?

mikolmogorov avatar Jun 03 '20 20:06 mikolmogorov

Thank you very much for your prompt reply!

I am afraid I don't have it anymore. I am re-running it now, will post it asap. I am also currently trying the 2.7.1 version.

I have another nanopore run of the same genome which has less coverage and a bit smaller N50. Flye finds less overlaps and runs with no error. I am posting the log of that sample, incase helpful. N6_G344.pdf

eyayd avatar Jun 03 '20 21:06 eyayd

@eyayd could you post the log of the run with --meta option?

P40.pdf

eyayd avatar Jun 05 '20 15:06 eyayd

@eyayd Somethig might not be right with your sample. Your expected genome size is 100m, and the coverage should be roughly 50x. But based on overlaps, the coverage is 600x - so this does not add up. No disjintings were assembled, which means that even though there were sufficient coverage, there were no reads that could be joined into contigouos fragments. For example, this is what you might see from amplicon sequenceing or PCR-based selection.

If you could share more details about your sample, I might have more insights.

mikolmogorov avatar Jun 07 '20 19:06 mikolmogorov