tombo
tombo copied to clipboard
Annotating raw file issues
Hello,
I did resquiggling and got errors, so was trying to annotate raw files with fastqs. And I'm keep having weird results from annotating. I tried two times but tombo can read all the fast5 identifies, but nothing is annotated. I'm using published dataset, so not sure whether the file has problems or not. I downloaded data from NCBI SRA, and fast5 files are single-fast5 files, but I only have one fastq file, which I fetched out using fastq-dump in SRA Toolkit.
If fastq file has problems, then what can I do next? Do I need to do basecalling on my own? Could anyone give me feedbacks?
GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"
tombo resquiggle --processes 128 ${Main}/fast5 ${GenomeFASTA}
[04:55:20] Final unsuccessful reads summary (100.0% reads unsuccessfully processed; 1040661 total reads):
100.0% (1040661 reads) : Fastq slot not present in --basecall-group
[04:55:20] Saving Tombo reads index to file.
GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"
tombo preprocess annotate_raw_with_fastqs --fast5-basedir ${Main}/fast5 \
--fastq-filenames ${Main}/fastq/WT-rep1.fastq \
--processes 128
[09:34:03] Preparing reads and extracting read identifiers.
****** WARNING ****** Basecalls exsit in specified slot for some reads. Set --overwrite option to overwrite these basecalls.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040661/1040661 [02:56<00:00, 5903.13it/s]
[09:37:03] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0it [00:28, ?it/s]
[09:37:32] Added sequences to a total of 0 reads.
]
GenomeFASTA="/workspace/nanopore_mRNA/AnnotationData/CDS/hg19/GENCODE_V38_hg19_Transcripts.fa"
Main="/workspace/nanopore_mRNA/Sumin/HEK293T_ERR4706161/WT-rep1-00"
tombo preprocess annotate_raw_with_fastqs --fast5-basedir ${Main}/fast5 \
--fastq-filenames ${Main}/fastq/WT-rep1.fastq \
--processes 128 --overwrite
[09:42:25] Preparing reads and extracting read identifiers.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1040661/1040661 [03:06<00:00, 5570.82it/s]
[09:45:58] Annotating FAST5s with sequence from FASTQs.
****** WARNING ****** Some FASTQ records contain read identifiers not found in any FAST5 files or sequencing summary files.
0%| | 0/1040661 [00:25<?, ?it/s]
[09:46:26] Added sequences to a total of 0 reads.
******************** WARNING ********************
Not all read ids from FAST5s or sequencing summary files were found in FASTQs.
This can result from reads that failed basecalling or if full sets of FAST5s/sequence summaries are not processed with full sets of FASTQs.
Regards, Sumin
What I usually do is basecall the data again with --fast5 out option during basecall. That way I have the fast5s with sequence information already in it. Then, I convert these fast5s into single fast5s and then proceed with resquiggle command.
What I usually do is basecall the data again with --fast5 out option during basecall. That way I have the fast5s with sequence information already in it. Then, I convert these fast5s into single fast5s and then proceed with resquiggle command.
Not possible with guppy.
It is possible on guppy but use an earlier version ( I would recommend any version before 6.3) of it as --fast5_out flag was deprecated in the recent versions.
I would recommend converting to Remora for raw signal alignment which uses standard POD5 and BAM files as input, greatly simplifying these issues.