Basecalling issues of recovered reads
Issue Report
Please describe the issue:
I recovered raw reads from a MinKNOW crashed run using the procedure described here. The recovered reads total approximately 650 GB of pod5 files. However, when basecalling these reads with Dorado (v0.9.5.0), I obtained an unexpectedly small BAM file (~18 GB). Typically, a dataset of similar size yields a BAM file over 60 GB. Additionally, Dorado initially returned an error indicating it couldn't locate the chemistry kit. I used pod5-api to copy run information from intact reads before the crash, resolving the chemistry error, but the BAM file size issue persists.
I expect a BAM file size consistent with previous runs, around or greater than 60 GB.
Steps to reproduce the issue:
- Recover raw reads from a crashed MinKNOW run as per Nanopore’s support page.
- Edit recovered pod5 files with pod5-api to copy run metadata from intact reads (to resolve chemistry identification errors).
- Basecall the edited pod5 files using Dorado (version 0.9.5.0) with modified base calling enabled (5mCG_5hmCG).
Run environment:
-Dorado version: 0.9.5.0
-Dorado command: dorado basecaller ./[email protected] /path/to/recovered_pod5_edited/ --modified-bases 5mCG_5hmCG --reference /path/to/mouse_c57_bl6.sorted.fa --device cuda:all > output.bam
-Operating system: Linux (Ubuntu) Hardware (CPUs, Memory, GPUs): NVIDIA L40 GPUs, 1.5Tb RAM, 150 cores CPU.
Source data type: pod5
Source data location: Local SSD storage
Details about data:
Flow cell: R10.4.1
Kit: SQK-LSK114
Total dataset size: Approximately 650 GB
Hi @uribertocchi,
It's unusual to have to edit the pod5 files to restore missing metadata after recovery - this suggests there might be more missing metadata which has corrupted the pod5 files resulting in poor basecalled reads which are discarded.
- How many reads are in the input versus the recovered pod5s?
- Do you have any logs from the recovery process?
- Were the original reads moved before being recovered?
- Do you have a snippet of the pod5 view before and after adding the missing metadata?
Kind regards, Rich
Could you also try to recover each file separately - there may be an issue recovering multiple files at once.
Hi @HalfPhoton,
Thank you for your quick response.
The recovered files were moved from the original folder, which was generated approximately a year ago and set aside due to errors. Unfortunately, I no longer have logs from the original recovery process. I still retain the original unedited pod5 files. Re-recovering the reads isn't possible at this stage; the current pod5 files are the only available data. Would it still be helpful to provide snippets of the pod5 file views before and after metadata editing?
Kind regards, Uri
@uribertocchi,
It's possible that when moving the original files some of the temporary hidden files were missed. The temporary files which contain the partial read metadata, start with a dot . which might have meant they were hidden from view and left behind.
This is possibly why some of the reads are missing the sequencing kit data.
We've since made improvements to the recovery script over the past year which would have alerted the user to this issue.
Unfortunately it appears as though this data is is unrecoverable.
Best regards, Rich