[Windows] Dorado --resume-from fails to open interrupted bam
Issue Report
Please describe the issue:
Via powershell; Dorado run was interrupted a couple days into basecalling; resulting in unfinished bam file. I was hoping to resume from this file, but attempts to use the --resume-from command have failed.
Steps to reproduce the issue:
essentially just run the command but with the --resume-from. I've wondered if i'm just incorrectly pathing to the file. but attempts at declaring absolute path the the data file haven't worked either
C:\Users\ONT\A1815.local.bam .\A1815.local.bam
It is worth noting perhaps that the bam file generated is rather large... 238GB, which I don't think should be the case...
Run environment:
- Dorado version: 0.7.3+6e6c45cd
- Dorado command: dorado basecaller hac,5mCG_5hmCG F:\Data\081524_P2_A1815\081524_P2_A1815\20240815_1132_P2S-00718-A_PAY91898_867b13e9/pod5 --verbose --reference C:\Users\ONT\Documents\GCA_000001405.15_GRCh38_no_alt_analysis_set.fna > A1815.local.bam
then to resume:
dorado basecaller hac,5mCG_5hmCG F:\Data\081524_P2_A1815\081524_P2_A1815\20240815_1132_P2S-00718-A_PAY91898_867b13e9/pod5 --reference C:\Users\ONT\Documents\GCA_000001405.15_GRCh38_no_alt_analysis_set.fna --resume-from A1815.local.bam > F:\Data\081524_P2_A1815\bam\A1815.bam
- Operating system: windows 10 pro
- Hardware (CPUs, Memory, GPUs): Nvidia 3080 - i7-11700k - 128GB ram
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
- Source data location (on device or networked drive - NFS, etc.): on device
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): total pod5 size ~1.5tb, R10
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
[2024-09-19 12:10:46.337] [info] Running: "basecaller" "hac,5mCG_5hmCG" "F:\Data\081524_P2_A1815\081524_P2_A1815\20240815_1132_P2S-00718-A_PAY91898_867b13e9/pod5" "--reference" "C:\Users\ONT\Documents\GCA_000001405.15_GRCh38_no_alt_analysis_set.fna" "--resume-from" "A1815.local.bam" [2024-09-19 12:10:46.405] [info] - downloading [email protected] with httplib [2024-09-19 12:10:46.870] [info] - downloading [email protected]_5mCG_5hmCG@v1 with httplib [2024-09-19 12:10:47.275] [info] Normalised: chunksize 10000 -> 9996 [2024-09-19 12:10:47.276] [info] Normalised: overlap 500 -> 498 [2024-09-19 12:10:47.277] [info] > Creating basecall pipeline [2024-09-19 12:10:54.926] [info] cuda:0 using chunk size 9996, batch size 1152 [2024-09-19 12:10:55.517] [info] cuda:0 using chunk size 4998, batch size 1408 [2024-09-19 12:11:38.079] [info] > Inspecting resume file... [2024-09-19 12:11:43.053] [error] finalise() not called on a HtsFile. [2024-09-19 12:11:43.054] [error] Could not open file: A1815.local.bam
Something worth noting have referenced other similar issues: #604 #427
This appears to be an issue with running dorado via Powershell... I ran a small test via CMD, in which I interrupted a run, and was able to successfully resume from the incomplete bam file...
I feel like it'd be very helpful for future folks to have this stated in the README.md for running dorado on Windows machines... save many days of troubleshooting.