dorado icon indicating copy to clipboard operation
dorado copied to clipboard

The --resume-from option writes billions of dots to stderr

Open zwets opened this issue 9 months ago • 8 comments

When resuming an aborted basecaller run with --resume-from, Dorado rapidly writes . characters to stderr and appears to be doing nothing else.

The issue is replicable. My aborted command line (on 0.9.5) was:

dorado basecaller -r --no-trim /path/to/model /path/to/pod5 >basecalled.bam

When this accidentally was aborted after ~4 hours (35% done) it had produced ~1.6GB in basecalled.bam. I then ran the same command with --resume-from:

mv basecalled.bam incomplete.bam
dorado basecaller -r --no-trim --resume-from incomplete.bam /path/to/model /path/to/pod5 >basecalled.bam

This process writes . characters to stderr at a very high rate and keeps a single CPU occupied at 100%. There is no apparent GPU work going on, though it did claim some GPU memory.

Other reports (#966 and #1251) suggest that output might eventually be written, but after 7 hours the basecalled.bam file was still at 0 bytes and I stopped the program.

I'm starting to wonder if the process isn't just busy writing . characters in a tight loop, as evidenced by the fact that screen was consuming 100% of a second CPU.

zwets avatar Apr 03 '25 17:04 zwets

Hi @zwets,

Can you share the SAM header line which contains the previous dorado command used by --resume-from

❯ samtools view test.bam -H | head | grep "dorado basecaller"
@PG     ID:basecaller   PN:dorado       VN:0.8.2+6b413c9  
CL:dorado basecaller models/[email protected]/ data/ --emit-moves  -x cuda:2,3   
...

HalfPhoton avatar Apr 04 '25 08:04 HalfPhoton

Hi Richard, thanks for your quick response. There was no output at all from the --resume-from command, so I can't grep for the header.

FYI, here is header of the "from scratch" BAM (the warning about the EOF marker should be because it just aborted - another long power cut ...)

$ samtools view basecalled.bam -H | head
[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
@HD     VN:1.6  SO:unknown
@PG     ID:basecaller   PN:dorado       VN:0.9.5+bd7fb217       CL:dorado "basecaller" "-r" "--no-trim" "/project/paddy/models/[email protected]" "/media/zwets/TOSHIBA EXT/st2"       DS:gpu:NVIDIA RTX 2000 Ada Generation Laptop GPU
@PG     ID:samtools     PN:samtools     PP:basecaller   VN:1.20 CL:samtools view -H basecalled.bam
@RG     ID:8dff0cab-bc21-4f5a-bed8-d935fd692fec_dna_r10.4.1_e8.2_400bps_sup@v5.0.0      PU:FBA72692     PM:DESKTOP-UGEE4NN      DT:2024-11-09T10:34:37.588+00:00        PL:ONT  DS:[email protected] runid=8dff0cab-bc21-4f5a-bed8-d935fd692fec        LB:17_32        SM:17_32

Hope this helps?

zwets avatar Apr 04 '25 10:04 zwets

@zwets, thanks. The issue might be that the bam file is truncated and is not getting read correctly by htslib in dorado.

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

A work around for this might be to convert the truncated bam into a sam file with samtools view and convert it back into a bam. Hopefully this will restore the EOF marker in the binary bam format and fix the issue. You might want to check the last few reads yourself to make sure they're not junk and delete them from the sam if necessary.

We'll take a look in dorado if there's anything we can do when we read a truncated bam.

Kind regards, Rich

HalfPhoton avatar Apr 04 '25 13:04 HalfPhoton

Hi Richard,

I used

samtools view broken.bam | samtools -b >fixed.bam

which got rid of the EOF marker warning. The last few reads in the fixed.bam look fine. Also, no more . tsunami when running the resume, but unfortunately it still errors out:

[2025-04-04 21:46:08.839] [info] Running: "basecaller" "-r" "--no-trim" "/project/paddy/models/[email protected]" "/media/zwets/TOSHIBA EXT/st2" "--resume-from" "st2/fixed.bam"
[2025-04-04 21:46:10.961] [info] > Creating basecall pipeline
[2025-04-04 21:46:13.272] [info] Calculating optimized batch size for GPU "NVIDIA RTX 2000 Ada Generation Laptop GPU" and model [email protected]. Full benchmarking will run for this device, which may take some time.
[2025-04-04 21:46:14.118] [info] cuda:0 using chunk size 12288, batch size 32
[2025-04-04 21:46:14.279] [info] cuda:0 using chunk size 6144, batch size 32
[2025-04-04 21:46:14.432] [info] > Inspecting resume file...
[2025-04-04 21:46:14.434] [error] finalise() not called on a HtsFile.
[2025-04-04 21:46:14.449] [error] Cmdline requires at least 4 tokens including binary name, found: 

In issue #1086 it is suggested that the finalise error is a spurious consequence of an earlier failure. A trace doesn't show anything suspicious happening though:

[2025-04-04 22:20:10.124] [info] Running: "basecaller" "-vvv" "-r" "--no-trim" "/project/paddy/models/[email protected]" "/media/zwets/TOSHIBA EXT/st2" "--resume-from" "st2/fixed.bam"
[2025-04-04 22:20:10.326] [trace] Model variant: '/project/paddy/models/[email protected]' unknown - assuming path
[2025-04-04 22:20:12.161] [trace] DeviceGetCount [ENTRY] 
[2025-04-04 22:20:12.161] [trace] DeviceGetCount [EXIT] 
[2025-04-04 22:20:12.175] [trace] Parsed: chunksize 10000 -> 12288
[2025-04-04 22:20:12.175] [trace] Parsed: overlap 500 -> 600
[2025-04-04 22:20:12.175] [info] > Creating basecall pipeline
[2025-04-04 22:20:12.175] [trace] BasecallModelConfig { qscale:1.05 qbias:1.3 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_score:4.43595e-41 scale:1 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationScalingParams { standardise:1 mean:93.6924 stdev:23.5067} } BatchParams { chunk_size:12288 overlap:600 batch_size:0} convs: { 0: ConvParams { insize:1 size:64 winlen:5 stride:1 activation:swish } 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish } 2: ConvParams { insize:64 size:128 winlen:9 stride:3 activation:swish } 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish } 4: ConvParams { insize:128 size:512 winlen:5 stride:2 activation:swish } } model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5 blank_score:2 expand_blanks:1 permute:true } transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 theta:10000 max_seq_len:2048 attn_window: [127, 128] deepnorm_alpha:2.44949 } upsample: EncoderUpsampleParams { d_model:512 scale_factor:2 }}}
[2025-04-04 22:20:12.175] [trace] ModBaseBatchParams { batchsize:1024 runners_per_caller:2 threads:4 threshold:0.05 }
[2025-04-04 22:20:12.328] [debug] TxEncoderStack: use_koi_tiled true.
[2025-04-04 22:20:12.328] [debug] TxEncoderStack: use_koi_volta_tiled false.
[2025-04-04 22:20:14.096] [debug] cuda:0 memory available: 7.86GB
[2025-04-04 22:20:14.097] [debug] cuda:0 memory limit 6.86GB
[2025-04-04 22:20:14.097] [debug] cuda:0 maximum safe estimated batch size at chunk size 12288 is 64
[2025-04-04 22:20:14.097] [debug] cuda:0 maximum safe estimated batch size at chunk size 6144 is 160
[2025-04-04 22:20:14.097] [debug] Auto batchsize cuda:0: testing up to 160 in steps of 32
[2025-04-04 22:20:14.097] [info] Calculating optimized batch size for GPU "NVIDIA RTX 2000 Ada Generation Laptop GPU" and model [email protected]. Full benchmarking will run for this device, which may take some time.
[2025-04-04 22:20:14.288] [trace] Auto batchsize cuda:0: iteration:0, ms/chunk 5.948119 ms
[2025-04-04 22:20:14.305] [trace] Auto batchsize cuda:0: iteration:1, ms/chunk 0.530720 ms
[2025-04-04 22:20:14.307] [debug] Auto batchsize cuda:0: 32, time per chunk 0.530720 ms
[2025-04-04 22:20:14.347] [trace] Auto batchsize cuda:0: iteration:0, ms/chunk 0.624254 ms
[2025-04-04 22:20:14.383] [trace] Auto batchsize cuda:0: iteration:1, ms/chunk 0.561571 ms
[2025-04-04 22:20:14.387] [debug] Auto batchsize cuda:0: 64, time per chunk 0.561571 ms
[2025-04-04 22:20:14.450] [trace] Auto batchsize cuda:0: iteration:0, ms/chunk 0.643371 ms
[2025-04-04 22:20:14.513] [trace] Auto batchsize cuda:0: iteration:1, ms/chunk 0.662653 ms
[2025-04-04 22:20:14.518] [debug] Auto batchsize cuda:0: 96, time per chunk 0.643371 ms
[2025-04-04 22:20:14.606] [trace] Auto batchsize cuda:0: iteration:0, ms/chunk 0.689880 ms
[2025-04-04 22:20:14.693] [trace] Auto batchsize cuda:0: iteration:1, ms/chunk 0.675856 ms
[2025-04-04 22:20:14.702] [debug] Auto batchsize cuda:0: 128, time per chunk 0.675856 ms
[2025-04-04 22:20:14.824] [trace] Auto batchsize cuda:0: iteration:0, ms/chunk 0.762853 ms
[2025-04-04 22:20:14.934] [trace] Auto batchsize cuda:0: iteration:1, ms/chunk 0.687859 ms
[2025-04-04 22:20:14.945] [debug] Auto batchsize cuda:0: 160, time per chunk 0.687859 ms
[2025-04-04 22:20:14.945] [debug] Adding chunk timings to internal cache for GPU NVIDIA RTX 2000 Ada Generation Laptop GPU, model [email protected] (1 entries)
[2025-04-04 22:20:14.945] [debug] Largest batch size for cuda:0: 32, time per chunk 0.530720 ms
[2025-04-04 22:20:14.945] [debug] Final batch size for cuda:0[0]: 32
[2025-04-04 22:20:14.945] [debug] Final batch size for cuda:0[1]: 32
[2025-04-04 22:20:14.946] [info] cuda:0 using chunk size 12288, batch size 32
[2025-04-04 22:20:14.946] [debug] cuda:0 Model memory 2.28GB
[2025-04-04 22:20:14.946] [debug] cuda:0 Decode memory 0.28GB
[2025-04-04 22:20:15.108] [info] cuda:0 using chunk size 6144, batch size 32
[2025-04-04 22:20:15.108] [debug] cuda:0 Model memory 1.14GB
[2025-04-04 22:20:15.108] [debug] cuda:0 Decode memory 0.14GB
[2025-04-04 22:20:15.252] [debug] BasecallerNode chunk size 12288
[2025-04-04 22:20:15.252] [debug] BasecallerNode chunk size 6144
[2025-04-04 22:20:15.253] [trace] Setting initial value for both chunk reserve times for worker 0.
[2025-04-04 22:20:15.253] [trace] Setting initial value for both chunk reserve times for worker 1.
[2025-04-04 22:20:15.253] [trace] Setting initial value for both chunk reserve times for worker 3.
[2025-04-04 22:20:15.253] [trace] Setting initial value for both chunk reserve times for worker 2.
[2025-04-04 22:20:15.263] [info] > Inspecting resume file...
[2025-04-04 22:20:15.265] [error] finalise() not called on a HtsFile.
[2025-04-04 22:20:15.273] [error] Cmdline requires at least 4 tokens including binary name, found: 

The chunk/batch lines it outputs are the same as in the original run, so the issue seems to occur in the "Inspecting resume file..." step.

I also tried with just the first 10k reads from the BAM, same result.

Hope the resume functionality can be made to work. This last BAM stranded in sight of the finish: at 90% after 12 hours of basecalling, ouch 😬

Cheers Marco

zwets avatar Apr 04 '25 20:04 zwets

On a hunch, this:

[2025-04-04 22:20:15.273] [error] Cmdline requires at least 4 tokens including binary name, found:

looks like it may be related to the CL: header, which samtools will have unhelpfully dropped due to its default --no-headers.

So I reran the selection with -h:

samtools view --no-PG -h broken.bam | head -1000 | samtools view --no-PG -b >incomplete.bam

The --no-PG were to stop samtools entirely from meddling in the file, but on second test its added PG headers don't affect Dorado.

Unfortunately, a resume run with the resulting BAM again gives the infinite stream of . at 2 x 100% CPU. 😢

zwets avatar Apr 04 '25 21:04 zwets

Yay! I got it working but it gets very weird.

In order to capture dorado's log output (which gets instantly obliterated by the flood of dots), I redirected the process's stderr:

dorado basecaller -r --no-trim --resume-from incomplete.bam /path/to/model /path/to/pod5 >basecalled.bam 2>error.log

To my great surprise, error.log didn't grow beyond 850 bytes. It should've blown up rapidly with dots (the console was now silent).

Better yet: within seconds I heard the GPU fire up, and basecalled.bam started to grow!

The error.log, all hunky dory:

[2025-04-05 00:47:12.999] [info] Running: "basecaller" "-r" "--no-trim" "/project/paddy/models/[email protected]" "/media/zwets/TOSHIBA EXT/st2" "--resume-from" "incomplete.bam"
[2025-04-05 00:47:15.100] [info] > Creating basecall pipeline
[2025-04-05 00:47:16.614] [info] Calculating optimized batch size for GPU "NVIDIA RTX 2000 Ada Generation Laptop GPU" and model [email protected]. Full benchmarking will run for this device, which may take some time.
[2025-04-05 00:47:17.466] [info] cuda:0 using chunk size 12288, batch size 64
[2025-04-05 00:47:17.795] [info] cuda:0 using chunk size 6144, batch size 64
[2025-04-05 00:47:18.052] [info] > Inspecting resume file...
[2025-04-05 00:47:18.052] [info] Resuming from file incomplete.bam...
[2025-04-05 00:47:18.067] [info] > 978 original read ids found in resume file.

No sign of the dots that we captured?!

To double check I reran the exact same command without 2>error.log. Sure enough the dot flood was back, with nothing happening on the GPU or in basecalled.bam even after 10 minutes.

Conclusion: unless we redirect stderr, dorado --with-resume spends all its time filling stderr with dots. Capture it and the dots magically vanish, while Dorado gets down to business. Weird, innit? :-)

zwets avatar Apr 04 '25 23:04 zwets

Hi @zwets,

Thanks for the detailed debugging information it's very much appreciated! This is going to be very helpful to us when we get around to fixing this.

Best regards, Rich

HalfPhoton avatar Apr 07 '25 08:04 HalfPhoton

Thanks Rich,

I'm a bit short on time, else I'd try and track down the bug in the sources.

For now you may want to add a note to the --resume-from documentation that mentions the workaround, i.e. to add the 2>error.log redirect to the basecaller command-line.

Cheers Marco

zwets avatar Apr 07 '25 10:04 zwets

Hi @zwets,

I haven't been able to reproduce the bug by intentionally truncating a BAM file and using it in --resume-from so I've got no fix for the dot tsunami.

I did take this opportunity to apply your suggestion to improve the error message if the --resume-from file is missing the CL header. This will certainly help users in future. Thank you for raising this and going through your debugging steps.

Best regards, Rich

HalfPhoton avatar May 16 '25 11:05 HalfPhoton

Closing as I haven't been able to reproduce this issue and we've improved error messaging when using --resume-from without headers.

HalfPhoton avatar May 29 '25 08:05 HalfPhoton

Hi @HalfPhoton

I am experiencing the same issue as @zwets reported. When I try to resume a basecalling process using I get the infinite dot stream without getting any output.

Like in his case, samtools view reoprts my BAM as truncated:

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated

Apparently my issue is solved by using his solution:

  1. Fixing the bam using samtools.
samtools view --no-PG -h broken.bam -b > incomplete.bam
  1. Redirecting the error stream (2>) to a file when running dorado.

Then, the new BAM file does increase its size. However, I have doubts if the file is correct or not, because of this strange stream of dots.

If this is helpful for you, the original dorado process was killed by my computer due to forced reboot of the system.

Best regards! Ferran.

FerranLlobetC avatar Sep 16 '25 13:09 FerranLlobetC

@FerranLlobetC, please create a new ticket and reference this one as this has been closed as information in the template (e.g. Dorado version) is important.

Kind regards, Rich

HalfPhoton avatar Sep 22 '25 08:09 HalfPhoton