Flye icon indicating copy to clipboard operation
Flye copied to clipboard

ERROR: Can't identify input file type

Open DaRinker opened this issue 1 year ago • 7 comments

I have a "raw" ONT bam file (de multiplexed with adaptors trimmed using ONT's new Dorado base caller)

I wanted to to a quick assembly with flye but am running into this issue that I can't find documented anywhere.

My workflow is as follows:

  1. Convert the ubam to fastq. I tried a) using BEDTools 'bamtofastq' option b) using samtools bam2fq and c) the "--emit-fastq" option in dorado.
  2. Assemble with flye using the following command: flye --nano-raw barcode02.bam_trimmed.fastq --out-dir <myoutputpath> --threads 16

All three fastqs inputs result in the same error from flye: ERROR: Can't identify input file type

The log file shows this:

[2024-08-19 13:30:55] root: INFO: Starting Flye 2.9.4-b1799 [2024-08-19 13:30:55] root: DEBUG: Cmd: /bin/Flye/bin/flye --nano-raw barcode02.bam_trimmed.fastq --out-dir --threads 16 [2024-08-19 13:30:55] root: DEBUG: Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] [2024-08-19 13:30:55] root: INFO: >>>STAGE: configure [2024-08-19 13:30:55] root: INFO: Configuring run [2024-08-19 13:31:15] root: INFO: Total read length: 3757884059 [2024-08-19 13:31:15] root: INFO: Reads N50/N90: 6352 / 1677 [2024-08-19 13:31:15] root: INFO: Minimum overlap set to 2000 [2024-08-19 13:31:15] root: INFO: >>>STAGE: assembly [2024-08-19 13:31:15] root: INFO: Assembling disjointigs [2024-08-19 13:31:15] root: DEBUG: -----Begin assembly log------ [2024-08-19 13:31:15] root: DEBUG: Running: flye-modules assemble --reads barcode02.bam_trimmed.fastq --out-asm /00-assembly/draft_assembly.fasta --config /bin/Flye/flye/config/bin_cfg/asm_raw_reads.cfg --log /flye.log --threads 16 --min-ovlp 2000 [2024-08-19 13:31:15] DEBUG: Build date: Aug 19 2024 13:08:56 [2024-08-19 13:31:15] DEBUG: Total RAM: 503 Gb [2024-08-19 13:31:15] DEBUG: Available RAM: 489 Gb [2024-08-19 13:31:15] DEBUG: Total CPUs: 64 [2024-08-19 13:31:15] DEBUG: Loading /bin/Flye/flye/config/bin_cfg/asm_raw_reads.cfg [2024-08-19 13:31:15] DEBUG: Loading /bin/Flye/flye/config/bin_cfg/asm_defaults.cfg [2024-08-19 13:31:15] DEBUG: big_genome_threshold=29000000 [2024-08-19 13:31:15] DEBUG: meta_read_filter_kmer_freq=100 [2024-08-19 13:31:15] DEBUG: chain_large_gap_penalty=2 [2024-08-19 13:31:15] DEBUG: chain_small_gap_penalty=0.5 [2024-08-19 13:31:15] DEBUG: chain_gap_jump_threshold=100 [2024-08-19 13:31:15] DEBUG: max_jump_gap=500 [2024-08-19 13:31:15] DEBUG: max_coverage_drop_rate=5 [2024-08-19 13:31:15] DEBUG: max_extensions_drop_rate=5 [2024-08-19 13:31:15] DEBUG: chimera_window=100 [2024-08-19 13:31:15] DEBUG: chimera_overhang=1000 [2024-08-19 13:31:15] DEBUG: min_reads_in_disjointig=4 [2024-08-19 13:31:15] DEBUG: max_inner_reads=10 [2024-08-19 13:31:15] DEBUG: max_inner_fraction=0.25 [2024-08-19 13:31:15] DEBUG: aggressive_dup_filter=1 [2024-08-19 13:31:15] DEBUG: max_separation=500 [2024-08-19 13:31:15] DEBUG: unique_edge_length=50000 [2024-08-19 13:31:15] DEBUG: min_repeat_res_support=0.51 [2024-08-19 13:31:15] DEBUG: out_paths_ratio=5 [2024-08-19 13:31:15] DEBUG: graph_cov_drop_rate=5 [2024-08-19 13:31:15] DEBUG: coverage_estimate_window=100 [2024-08-19 13:31:15] DEBUG: max_bubble_length=50000 [2024-08-19 13:31:15] DEBUG: loop_coverage_rate=1.5 [2024-08-19 13:31:15] DEBUG: repeat_edge_cov_mult=1.75 [2024-08-19 13:31:15] DEBUG: weak_detach_rate=5 [2024-08-19 13:31:15] DEBUG: tip_coverage_rate=2 [2024-08-19 13:31:15] DEBUG: tip_length_rate=2 [2024-08-19 13:31:15] DEBUG: output_gfa_before_rr=1 [2024-08-19 13:31:15] DEBUG: remove_alt_edges=0 [2024-08-19 13:31:15] DEBUG: low_cutoff_warning=1 [2024-08-19 13:31:15] DEBUG: kmer_size=17 [2024-08-19 13:31:15] DEBUG: use_minimizers=0 [2024-08-19 13:31:15] DEBUG: reads_base_alignment=0 [2024-08-19 13:31:15] DEBUG: meta_read_top_kmer_rate=0.40 [2024-08-19 13:31:15] DEBUG: maximum_jump=1500 [2024-08-19 13:31:15] DEBUG: maximum_overhang=1500 [2024-08-19 13:31:15] DEBUG: repeat_kmer_rate=100 [2024-08-19 13:31:15] DEBUG: assemble_ovlp_divergence=0.10 [2024-08-19 13:31:15] DEBUG: assemble_divergence_relative=1 [2024-08-19 13:31:15] DEBUG: repeat_graph_ovlp_divergence=0.08 [2024-08-19 13:31:15] DEBUG: read_align_ovlp_divergence=0.25 [2024-08-19 13:31:15] DEBUG: hpc_scoring_on=0 [2024-08-19 13:31:15] DEBUG: add_unassembled_reads=0 [2024-08-19 13:31:15] DEBUG: extend_contigs_with_repeats=0 [2024-08-19 13:31:15] DEBUG: min_read_cov_cutoff=3 [2024-08-19 13:31:15] DEBUG: short_tip_length=20000 [2024-08-19 13:31:15] DEBUG: long_tip_length=100000 [2024-08-19 13:31:15] DEBUG: Running with k-mer size: 17 [2024-08-19 13:31:15] DEBUG: Running with minimum overlap 2000 [2024-08-19 13:31:15] DEBUG: Metagenome mode: N [2024-08-19 13:31:15] DEBUG: Short mode: N [2024-08-19 13:31:15] INFO: Reading sequences [2024-08-19 13:31:15] ERROR: Can't identify input file type

DaRinker avatar Aug 19 '24 18:08 DaRinker

I have now duplicated this error with another ONT bam-to-fastq file (again using dorado's --emit-fastq option)

Visually, I don't see anything obviously strange about the fastq:

$ head barcode02.bam_trimmed.fastq @fcd84ea3-85e5-437a-ba3e-900c894e41c0 st:Z:2024-05-14T17:52:23.517+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTGCACCAACCATACCTAATAATAATATTATTGAACTTATTATTAATCATATTGAATAACTAGTATACATTATGTTTCCTATTCCTGTTATATGAGTAAATTCTACTAATGAACTATCTCAACCTTTACTACTTGCATAAGCTATTTCTTGTTTTAAATCTACTATCTGAAAAAATTCATTATCTAATTGAACATTATATACTTCAGAAAATGAATTTGATAAATAAGAAACAATTGTATTATCTGTTAAATTAGAAGGTAATACTTGTCCTACAATATAGTAGAAAAGTAAAACTGTTAAAACAGCTAAAGGTATATCATTATTAGTTTTCACTTAAAAGTTCAGATATTCTAATATTTATTAACATTAATATGAATAAGAATAAAATTGATACAGCTCCTACATACACTAAAATATAAGATAATCCTATGTAATTATATCCTACTAATATTAATAAACCAGCAATAAGTCATGATTATGGTCTATTAATTACAATTATTCTCCTGGTGGAGTCATGTCAAGCAGCCATTCTGCTCTTACTGCGAATCCGTGCTCCGTAAATATAAAGCAGCAGCGGGATAGGGATCATCAATACTCCCACACATCCAAGCAGTGTGCAGGCCCATTCAACACCTAATCCATCAAACTGTATCATTTCAGTCAGTTTCATCATTTGAAAGCCCAAGGTTTGGTGACTGACCATATACGAGGTGCTGTCGATTCCGTTTGTAGTCGTCTGTTTTAACCTTAGCATACGTATGG + BFGJKHECFAHPLHDEBBHSFSKEK98D9<898@BFHFGEEFJIGKFNOLSLSFSGSIKKJKSJJSH??@>;;;EFCFCDILSSSSFSSSJJSLOSGFD@ACBFJFJEJQSSSS?>SSSLMLMGLOHGHSHSSIIISKLLHFGELISSSFJSFMLKHHISGJGJMCISJFSKHISKLLHNNSJSKFGE@?ADNEPSJFGGHJDE.,'''SJGDFIFILKSGGDDDLSGPRMSSOSNIICLG2222@KKS?>??>HGSJA778SGIEPHCHKEFSEIMHQSAABADBF@9IA??@BILPKBA;::9<<=E631158=8899;=@IKC><>CGEG>;,(+DJAIECDFDLIJIIIJGGFNMSMJCABAAC=;<>>77SLSNLKOJINISKMHLNSJSOPLSHGSSSGNC@AAB/-+)+&'+,59JSJFGFIJJGCJIMGSOMSHRMSJSDGDJF>@<>ACB9B5FLQLFCBCDKMISRJGD@KGDFCE;3,'=?BBHHGEIHJSGLEFOIFHFEFRQ8673222=;?>5555D?CC@HFEDKGHSNHSJSFOKOJSGIFIDDFSGCDDCFE<HSFJIKSSKKEILJISIISFIJGDDMGEHRSG;:::LJPJJPHGFJGFIGEHJHFKMMCABDHSNHRGFBBDHPKSHGIE@@?COGLGKSIE666/./+,+)10036662,--/033//--) @41c3fb3a-1820-407f-ba8d-4bc9a28b883e st:Z:2024-05-14T17:53:59.847+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AGTGTTATGTACACTGATTCAGTTACATTGTGCTTTGCTAAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTATATGTATACAGCCCAGATGGCCCATGCCTAGAGCGCTATCCAGGGGCGCGCGCTGCCGACTGGATATCTAGAGAACCATGAGCAGAATGAGGTGCTGTCGATTCCGTTTGTAGTCGTCTGTTTAACCTTAACAATGGTA + .4)'(&$##%%%%%&&&&'&%##$##$(&%&$)-2465?=AD?><;<BDEFGE@@CIAABBEPDFFFOSILFEDHHFAA565656<?BC<GMICSLSJFJEFOHDHHSIFHFJSPABDJFFDECBDEFNHIFEDDKSHJIOFFB>?CCCB?10++,0.+-+,7689AHHCJ=-,,.;8/7987B@?A===AAA;<;8:99;4%#$''(+- @d81f95f0-66fa-49e7-9c39-403b99c7d5f2 st:Z:2024-05-14T17:52:04.836+00:00 RG:Z:7f7854be46aac37b85e749c5c1b729e69ac43456_dna_r10.4.1_e8.2_400bps_hac@v4.3.0_SQK-NBD114-24_barcode02 AAGGTTAAACAGACGACTACAAACGGAATCGACAGCACCTCAATCAGTCCAGCTGCTGGCCCCAATTGATCTGAATGAATGTAATAAGGAAATGCAATGTATTATTCAGAGAAAGATCAAGAGCAAATACTCTTGCAAGCTAAGATGAGATTGATACAGAGAATCAAGCGTCATCAATCAGCAGGGTCTTCTGCAACTATATACCGTCTTTGCCCCGAGGTGCTGTCGATTCCGTTTGTAGTCATCTGTTTAACCTTAGCGATA

DaRinker avatar Aug 19 '24 19:08 DaRinker

UPDATE: Still stuck.

Everything about the input fastq input seems okay (I'm able to trim it with chopper, quality check it with NanoPlot, and assemble it with raven). But flye continues to throw the same error, even if I use the chopper output.

Maybe there's a default flye option that needs disabling/changing? But this is starting to look like a bug to me.

DaRinker avatar Aug 23 '24 15:08 DaRinker

@DaRinker that seems very strange. It has something to with with how the file is named, but from the info that you sent it seems ok to me. This error is thrown by the simple function that identifies fasta / fastq suffix. There is a similar function in Python that runs earlier, and it didn't thrown the error.

Can you send the full log file? Did you compile Flye from source or used a bioconda installation?

mikolmogorov avatar Aug 26 '24 13:08 mikolmogorov

Thanks for the reply.

I compiled flye from source and it seems to run correctly at least for all of our "older" ONT fastq files (even just published a paper where I used flye extensively, so it had been working flawlessly).

And yes, I noticed that the code parses the filename to ID the file type, so I tried renaming the file to that of an old ONT fastq file that flye likes. Still got the same error though (!?)

Here's the complete log file for a run that I just attempted:

$ cat flye_assemblies/barcode02/flye.log [2024-08-26 09:24:24] root: INFO: Starting Flye 2.9.4-b1799 [2024-08-26 09:24:24] root: DEBUG: Cmd: /bin/Flye/bin/flye --nano-raw SQK-NBD114-24_barcode02_filtered.fastq.gz --out-dir /myworkingdirectory/flye_assemblies/barcode02/ --threads 16 [2024-08-26 09:24:24] root: DEBUG: Python version: 3.9.12 (main, Apr 5 2022, 06:56:58) [GCC 7.5.0] [2024-08-26 09:24:24] root: INFO: >>>STAGE: configure [2024-08-26 09:24:24] root: INFO: Configuring run [2024-08-26 09:25:02] root: INFO: Total read length: 1774370109 [2024-08-26 09:25:02] root: INFO: Reads N50/N90: 6363 / 2439 [2024-08-26 09:25:02] root: INFO: Minimum overlap set to 2000 [2024-08-26 09:25:02] root: INFO: >>>STAGE: assembly [2024-08-26 09:25:02] root: INFO: Assembling disjointigs [2024-08-26 09:25:02] root: DEBUG: -----Begin assembly log------ [2024-08-26 09:25:02] root: DEBUG: Running: flye-modules assemble --reads /panfs/accrepfs.vampire/myworkingdirectory/SQK-NBD114-24_barcode02_filtered.fastq.gz --out-asm /myworkingdirectory/flye_assemblies/barcode02/00-assembly/draft_assembly.fasta --config /bin/Flye/flye/config/bin_cfg/asm_raw_reads.cfg --log /myworkingdirectory/flye_assemblies/barcode02/flye.log --threads 16 --min-ovlp 2000 [2024-08-26 09:25:02] DEBUG: Build date: Aug 19 2024 13:08:56 [2024-08-26 09:25:02] DEBUG: Total RAM: 503 Gb [2024-08-26 09:25:02] DEBUG: Available RAM: 489 Gb [2024-08-26 09:25:02] DEBUG: Total CPUs: 64 [2024-08-26 09:25:02] DEBUG: Loading /bin/Flye/flye/config/bin_cfg/asm_raw_reads.cfg [2024-08-26 09:25:02] DEBUG: Loading /bin/Flye/flye/config/bin_cfg/asm_defaults.cfg [2024-08-26 09:25:02] DEBUG: big_genome_threshold=29000000 [2024-08-26 09:25:02] DEBUG: meta_read_filter_kmer_freq=100 [2024-08-26 09:25:02] DEBUG: chain_large_gap_penalty=2 [2024-08-26 09:25:02] DEBUG: chain_small_gap_penalty=0.5 [2024-08-26 09:25:02] DEBUG: chain_gap_jump_threshold=100 [2024-08-26 09:25:02] DEBUG: max_jump_gap=500 [2024-08-26 09:25:02] DEBUG: max_coverage_drop_rate=5 [2024-08-26 09:25:02] DEBUG: max_extensions_drop_rate=5 [2024-08-26 09:25:02] DEBUG: chimera_window=100 [2024-08-26 09:25:02] DEBUG: chimera_overhang=1000 [2024-08-26 09:25:02] DEBUG: min_reads_in_disjointig=4 [2024-08-26 09:25:02] DEBUG: max_inner_reads=10 [2024-08-26 09:25:02] DEBUG: max_inner_fraction=0.25 [2024-08-26 09:25:02] DEBUG: aggressive_dup_filter=1 [2024-08-26 09:25:02] DEBUG: max_separation=500 [2024-08-26 09:25:02] DEBUG: unique_edge_length=50000 [2024-08-26 09:25:02] DEBUG: min_repeat_res_support=0.51 [2024-08-26 09:25:02] DEBUG: out_paths_ratio=5 [2024-08-26 09:25:02] DEBUG: graph_cov_drop_rate=5 [2024-08-26 09:25:02] DEBUG: coverage_estimate_window=100 [2024-08-26 09:25:02] DEBUG: max_bubble_length=50000 [2024-08-26 09:25:02] DEBUG: loop_coverage_rate=1.5 [2024-08-26 09:25:02] DEBUG: repeat_edge_cov_mult=1.75 [2024-08-26 09:25:02] DEBUG: weak_detach_rate=5 [2024-08-26 09:25:02] DEBUG: tip_coverage_rate=2 [2024-08-26 09:25:02] DEBUG: tip_length_rate=2 [2024-08-26 09:25:02] DEBUG: output_gfa_before_rr=1 [2024-08-26 09:25:02] DEBUG: remove_alt_edges=0 [2024-08-26 09:25:02] DEBUG: low_cutoff_warning=1 [2024-08-26 09:25:02] DEBUG: kmer_size=17 [2024-08-26 09:25:02] DEBUG: use_minimizers=0 [2024-08-26 09:25:02] DEBUG: reads_base_alignment=0 [2024-08-26 09:25:02] DEBUG: meta_read_top_kmer_rate=0.40 [2024-08-26 09:25:02] DEBUG: maximum_jump=1500 [2024-08-26 09:25:02] DEBUG: maximum_overhang=1500 [2024-08-26 09:25:02] DEBUG: repeat_kmer_rate=100 [2024-08-26 09:25:02] DEBUG: assemble_ovlp_divergence=0.10 [2024-08-26 09:25:02] DEBUG: assemble_divergence_relative=1 [2024-08-26 09:25:02] DEBUG: repeat_graph_ovlp_divergence=0.08 [2024-08-26 09:25:02] DEBUG: read_align_ovlp_divergence=0.25 [2024-08-26 09:25:02] DEBUG: hpc_scoring_on=0 [2024-08-26 09:25:02] DEBUG: add_unassembled_reads=0 [2024-08-26 09:25:02] DEBUG: extend_contigs_with_repeats=0 [2024-08-26 09:25:02] DEBUG: min_read_cov_cutoff=3 [2024-08-26 09:25:02] DEBUG: short_tip_length=20000 [2024-08-26 09:25:02] DEBUG: long_tip_length=100000 [2024-08-26 09:25:02] DEBUG: Running with k-mer size: 17 [2024-08-26 09:25:02] DEBUG: Running with minimum overlap 2000 [2024-08-26 09:25:02] DEBUG: Metagenome mode: N [2024-08-26 09:25:02] DEBUG: Short mode: N [2024-08-26 09:25:02] INFO: Reading sequences [2024-08-26 09:25:02] ERROR: Can't identify input file type -----------End assembly log------------ [2024-08-26 09:25:02] root: ERROR: Command '['flye-modules', 'assemble', '--reads', '/panfs/accrepfs.vampire/myworkingdirectory/SQK-NBD114-24_barcode02_filtered.fastq.gz', '--out-asm', '/myworkingdirectory/flye_assemblies/barcode02/00-assembly/draft_assembly.fasta', '--config', '/bin/Flye/flye/config/bin_cfg/asm_raw_reads.cfg', '--log', '/myworkingdirectory/flye_assemblies/barcode02/flye.log', '--threads', '16', '--min-ovlp', '2000']' returned non-zero exit status 1. [2024-08-26 09:25:02] root: ERROR: Pipeline aborted

DaRinker avatar Aug 26 '24 14:08 DaRinker

Ah! I might have (partially) identified the problem @mikolmogorov
There's something about my working directory path that is contributing to the issue .

If I take a "good" fastq, one that flye has historically liked and put it in the same directory that contains the problematic fastq file, suddenly flye can't identify it either. I've tried both relative and absolute paths and get the error both ways.

So, I then tried moving my problematic fastq to a different directory and suddenly flye is happy with it. It's very weird but I can live with it.

And, while it's probably still technically a bug of some sort, I can't say that it's not due to some idiosyncrasy in my cluster's architecture that is introducing some an edge case failure here.

DaRinker avatar Aug 26 '24 14:08 DaRinker

Really puzzling - the function that throws the error is very simple and hasn't been changed in years. It's just parsing the file name. My only guess is maybe your system is somehow using a different character encoding for file paths.

mikolmogorov avatar Aug 27 '24 12:08 mikolmogorov

Thanks. Agree that it could be system specific but cannot confirm. For now, not having the output directory be a subdirectory of the fastq file's location is working.

Flye is now running as expected!

On Tue, Aug 27, 2024, 7:25 AM Mikhail Kolmogorov @.***> wrote:

Really puzzling - the function that throws the error is very simple and hasn't been changed in years. It's just parsing the file name. My only guess is maybe your system is somehow using a different character encoding for file paths.

— Reply to this email directly, view it on GitHub https://github.com/mikolmogorov/Flye/issues/719#issuecomment-2312421317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4HAMJRIF66FXBNYFBP5ALZTRVZ5AVCNFSM6AAAAABMYMLVQ2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJSGQZDCMZRG4 . You are receiving this because you were mentioned.Message ID: @.***>

DaRinker avatar Aug 27 '24 22:08 DaRinker