Issue Report

Please describe the issue:

When I run dorado basecall I get strange errors. The samples are RNA sequenced with RNA004, and I want to map m6A. The issues apear when reads are already loaded, sometimes a couple can be succesfully processed before it crashes. Running normally I get a Invalid: Remaining data at end of signal buffer terminate called after throwing an instance of 'std::logic_error'ling what(): ModBaseRunner received signal and sequence chunks with different lengths.

I also run a script that fixes the broken header of an interrupted bam file and automatically resumes it. This worked on an HPC, but not in my machine. The error is different than running it out of script and I get different errors on some of the tries.

Steps to reproduce the issue:

Running the dorado command below on the DRS pod5 file should reproduce it. The script I use is here:

dorado.txt

Run environment:

Dorado version: 0.9.1
Dorado command: ``` dorado basecaller hac /home/acee-lab/Didac/uploads/OHMX20250008_001/pod5 --m in-qscore 7 --emit-moves --mm2-opts "-x map-ont --secondary=no" --verbose -r --reference ./Refs/Mus_musculus_c57bl6nj_v1.112.pre_transcriptome.fa -x cuda:0 --modified-bases-models ./Models/ [email protected]_inosine_m6A@v1 > ./Results/25-03-24_BMN1+2Dorado/T4/T4.bam
Operating system: Ubuntu 24.04.1 LTS
Nvidia versions:
- Driver Version : 560.35.05
- CUDA Version: 12.6
Hardware (CPUs, Memory, GPUs):
- CPU: Intel® Core™ i9-14900KF
- GPU: RTX4090
- RAM: 64G
Source data type : pod5
Source data location : on device
Details about data :
- flow cell minIon
- kit SQK-RNA004
- read lengths N50 is about 1000
- number of reads Between 2M and 9M
- total dataset size in TB : 2.2 TB across 13 samples
Dataset to reproduce: One pod5 file that failed + the fasta file used https://sendanywhe.re/8EXMLFFA

Logs

Attached

logs_from_CLI.txt logs_from_script.txt

Mar 24 '25 16:03 didacjs

Collecting the error messages:

Command:

$DORADO_DIR basecaller hac $POD5_DIR --min-qscore 7 --emit-moves --mm2-opts "-x map-ont --secondary=no" --verbose -r --reference $FASTA -x cuda:0 --resume-from ${BAM_FILE}_temp.bam --modified-bases-models $MODEL > "$BAM_FILE"

unit8_t overflow (3x)

terminate called after throwing an instance of 'std::runtime_error'
  what():  value cannot be converted to type uint8_t without overflow
/home/acee-lab/Didac/Code/dorado.sh: line 54:  7517 Aborted                 (core dumped)

CUDAGuardImpl

terminate called after throwing an instance of 'c10::Error'
  what():  d.is_cuda() INTERNAL ASSERT FAILED at "/pytorch/pyold/c10/cuda/impl/CUDAGuardImpl.h":31, please report a bug to PyTorch. 
Exception raised from exchangeDevice at /pytorch/pyold/c10/cuda/impl/CUDAGuardImpl.h:31 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x70bb0c2389b7 in /home/acee-lab/Didac/.local/bin/dorado-0.9.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x70bb057bd1de in /home/acee-lab/Didac/.local/bin/dorado-0.9.1-linux-x64/bin/../lib/libdorado_torch_lib.so)
...

caffe2.proto

terminate called after throwing an instance of 'c10::Error'
  what():  Unknown device: -1. If you have recently updated the caffe2.proto file to add a new device type, did you forget to update the DeviceTypeName() function to reflect such recent changes?
Exception raised from DeviceTypeName at /pytorch/pyold/c10/core/DeviceType.cpp:55 (most recent call first):

Remaining data at end of signal buffer / Modbase Runner

[2025-03-24 16:43:55.480] [error] Failed to get read 2544 signal: Invalid: Remaining data at end of signal buffer                                                                            
[2025-03-24 16:43:55.481] [error] Failed to get read 2550 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.686] [error] Failed to get read 6773 signal: Invalid: Remaining data at end of signal buffer                                                                            
[2025-03-24 16:43:55.688] [error] Failed to get read 6798 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.689] [error] Failed to get read 6804 signal: Invalid: Remaining data at end of signal buffer
[2025-03-24 16:43:55.689] [error] Failed to get read 6807 signal: Invalid: Remaining data at end of signal buffer
terminate called after throwing an instance of 'std::logic_error'ling                                                                                                                        
  what():  ModBaseRunner received signal and sequence chunks with different lengths.
Aborted (core dumped)

Mar 24 '25 17:03 HalfPhoton

Hi @didacjs,

This looks like a data issue but what's in the header of ${BAM_FILE}_temp.bam?

Mar 24 '25 17:03 HalfPhoton

I've had a look at your pod5 file and I've found this bad read which fails to decompress which would cause issues in pod5.

8c63bb2e-556b-45ca-bc77-9a4fe1492744

This snippet will print read ids which have are corrupt.

import pod5 as p5
with p5.Reader("my.pod5") as reader:
  for read in reader.reads():
    try:
      _ = len(read.signal) # Attempt to decompress the signal
    except:
      print(read.read_id)

You could use this idea to write out all corrupt read ids and pass the list to the pod5 filter tool to remove the reads which are corrupt from your pod5 files.

Best regards, Rich

Mar 24 '25 17:03 HalfPhoton

Hello,

The pod5 file I sent, and all the others I have, pass this decompression test. Doing an md5sum test I see it got corrutpted when uploading it to https://send-anywhere.com. I apologise. Here is another link, this time I tested it https://fromsmash.com/pod5-4-reproduction. On the other hand I also upload the header of ${BAM_FILE}_temp.bam Additionally, I get a differrent error when specifying directly the model directory, I also attach that error, let me know if it should be a separate issue.

Thank you for your time,

Dídac

header.txt error_when_specifying_model_directory.txt

Mar 25 '25 16:03 didacjs

Various errors on dorado basecaller RNA hac 5.1 for m6A with 4090

Issue Report

Please describe the issue:

Steps to reproduce the issue:

Run environment:

Logs

Command:

unit8_t overflow (3x)

CUDAGuardImpl

caffe2.proto

Remaining data at end of signal buffer / Modbase Runner