Various errors on dorado basecaller RNA hac 5.1 for m6A with 4090
Issue Report
Please describe the issue:
When I run dorado basecall I get strange errors. The samples are RNA sequenced with RNA004, and I want to map m6A. The issues apear when reads are already loaded, sometimes a couple can be succesfully processed before it crashes. Running normally I get a
Invalid: Remaining data at end of signal buffer terminate called after throwing an instance of 'std::logic_error'ling what():
ModBaseRunner received signal and sequence chunks with different lengths.
I also run a script that fixes the broken header of an interrupted bam file and automatically resumes it. This worked on an HPC, but not in my machine. The error is different than running it out of script and I get different errors on some of the tries.
Steps to reproduce the issue:
Running the dorado command below on the DRS pod5 file should reproduce it. The script I use is here:
Run environment:
- Dorado version: 0.9.1
- Dorado command: ``` dorado basecaller hac /home/acee-lab/Didac/uploads/OHMX20250008_001/pod5 --m in-qscore 7 --emit-moves --mm2-opts "-x map-ont --secondary=no" --verbose -r --reference ./Refs/Mus_musculus_c57bl6nj_v1.112.pre_transcriptome.fa -x cuda:0 --modified-bases-models ./Models/ [email protected]_inosine_m6A@v1 > ./Results/25-03-24_BMN1+2Dorado/T4/T4.bam
- Operating system: Ubuntu 24.04.1 LTS
- Nvidia versions:
- Driver Version : 560.35.05
- CUDA Version: 12.6
- Hardware (CPUs, Memory, GPUs):
- CPU: Intel® Core™ i9-14900KF
- GPU: RTX4090
- RAM: 64G
- Source data type : pod5
- Source data location : on device
- Details about data :
- flow cell minIon
- kit SQK-RNA004
- read lengths N50 is about 1000
- number of reads Between 2M and 9M
- total dataset size in TB : 2.2 TB across 13 samples
- Dataset to reproduce: One pod5 file that failed + the fasta file used https://sendanywhe.re/8EXMLFFA
Logs
Attached
Collecting the error messages:
Command:
$DORADO_DIR basecaller hac $POD5_DIR --min-qscore 7 --emit-moves --mm2-opts "-x map-ont --secondary=no" --verbose -r --reference $FASTA -x cuda:0 --resume-from ${BAM_FILE}_temp.bam --modified-bases-models $MODEL > "$BAM_FILE"
unit8_t overflow (3x)
terminate called after throwing an instance of 'std::runtime_error'
what(): value cannot be converted to type uint8_t without overflow
/home/acee-lab/Didac/Code/dorado.sh: line 54: 7517 Aborted (core dumped)
CUDAGuardImpl
terminate called after throwing an instance of 'c10::Error'
what(): d.is_cuda() INTERNAL ASSERT FAILED at "/pytorch/pyold/c10/cuda/impl/CUDAGuardImpl.h":31, please report a bug to PyTorch.
Exception raised from exchangeDevice at /pytorch/pyold/c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x70bb0c2389b7 in /home/acee-lab/Didac/.local/bin/dorado-0.9.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x70bb057bd1de in /home/acee-lab/Didac/.local/bin/dorado-0.9.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
...
caffe2.proto
terminate called after throwing an instance of 'c10::Error'
what(): Unknown device: -1. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /pytorch/pyold/c10/core/DeviceType.cpp:55 (most recent call first):
Remaining data at end of signal buffer / Modbase Runner
[2025-03-24 16:43:55.480] [error] Failed to get read 2544 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.481] [error] Failed to get read 2550 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.686] [error] Failed to get read 6773 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.688] [error] Failed to get read 6798 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.689] [error] Failed to get read 6804 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.689] [error] Failed to get read 6807 signal: Invalid: Remaining data at end of signal buffer
terminate called after throwing an instance of 'std::logic_error'ling
what(): ModBaseRunner received signal and sequence chunks with different lengths.
Aborted (core dumped)
Hi @didacjs,
This looks like a data issue but what's in the header of ${BAM_FILE}_temp.bam?
I've had a look at your pod5 file and I've found this bad read which fails to decompress which would cause issues in pod5.
8c63bb2e-556b-45ca-bc77-9a4fe1492744
This snippet will print read ids which have are corrupt.
import pod5 as p5
with p5.Reader("my.pod5") as reader:
for read in reader.reads():
try:
_ = len(read.signal) # Attempt to decompress the signal
except:
print(read.read_id)
You could use this idea to write out all corrupt read ids and pass the list to the pod5 filter tool to remove the reads which are corrupt from your pod5 files.
Best regards, Rich
Hello,
The pod5 file I sent, and all the others I have, pass this decompression test. Doing an md5sum test I see it got corrutpted when uploading it to https://send-anywhere.com. I apologise. Here is another link, this time I tested it https://fromsmash.com/pod5-4-reproduction. On the other hand I also upload the header of ${BAM_FILE}_temp.bam Additionally, I get a differrent error when specifying directly the model directory, I also attach that error, let me know if it should be a separate issue.
Thank you for your time,
Dídac