dorado
dorado copied to clipboard
[email protected] crashes my server
Issue Report
When I perform basecalling using the model [email protected], my server crashes.
Please describe the issue:
Please provide a clear and concise description of the issue you are seeing and the result you expect.
After launching the basecalling using the model [email protected], the server gets stuck and I have to reboot it. This does not happen with the previous model [email protected]
Steps to reproduce the issue:
Please list any steps to reproduce the issue.
Run environment:
- Dorado version: 0.7.2
- Dorado command: dorado basecaller --min-qscore 10
/[email protected] > - Operating system: Ubuntu 20.04
- Hardware (CPUs, Memory, GPUs):
- 4 x A100 80GB
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
- pod5
- Source data location (on device or networked drive - NFS, etc.):
- local
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
- 15 pod5 files
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
- Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
Hey @paoloinglese can you post the output from nvidia-smi?
Hi @iiSeymour ,
here it is:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:06:00.0 Off | 0 |
| N/A 31C P0 61W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:07:00.0 Off | 0 |
| N/A 28C P0 60W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:08:00.0 Off | 0 |
| N/A 31C P0 60W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:09:00.0 Off | 0 |
| N/A 27C P0 58W / 500W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
These are the last lines of the output before it crashes:
[2024-08-05 11:39:56.431] [info] cuda:0 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:0 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:0 Decode memory 4.44GB
[2024-08-05 11:39:56.431] [info] cuda:3 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:3 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:3 Decode memory 4.44GB
[2024-08-05 11:39:56.431] [debug] Largest batch size for cuda:1: 512, time per chunk 0.345104 ms
[2024-08-05 11:39:56.431] [debug] Final batch size for cuda:1[0]: 512
[2024-08-05 11:39:56.431] [debug] Final batch size for cuda:1[1]: 512
[2024-08-05 11:39:56.431] [info] cuda:2 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:2 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:2 Decode memory 4.44GB
[2024-08-05 11:39:56.435] [info] cuda:1 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.435] [debug] cuda:1 Model memory 36.54GB
[2024-08-05 11:39:56.435] [debug] cuda:1 Decode memory 4.44GB
while these are from v4.3.0:
[2024-08-05 12:56:30.338] [debug] Final batch size for cuda:1[0]: 1728
[2024-08-05 12:56:30.338] [debug] Final batch size for cuda:1[1]: 1728
[2024-08-05 12:56:30.349] [info] cuda:1 using chunk size 9996, batch size 1728
[2024-08-05 12:56:30.350] [debug] cuda:1 Model memory 29.48GB
[2024-08-05 12:56:30.350] [debug] cuda:1 Decode memory 12.19GB
[2024-08-05 12:56:31.135] [info] cuda:1 using chunk size 4998, batch size 1728
[2024-08-05 12:56:31.135] [debug] cuda:1 Model memory 14.74GB
[2024-08-05 12:56:31.135] [debug] cuda:1 Decode memory 6.09GB
[2024-08-05 12:56:31.500] [debug] Auto batchsize cuda:3: 6848, time per chunk 0.083813 ms
[2024-08-05 12:56:31.794] [debug] Largest batch size for cuda:3: 1728, time per chunk 0.079983 ms
[2024-08-05 12:56:31.794] [debug] Final batch size for cuda:3[0]: 1728
[2024-08-05 12:56:31.794] [debug] Final batch size for cuda:3[1]: 1728
[2024-08-05 12:56:31.796] [info] cuda:3 using chunk size 9996, batch size 1728
[2024-08-05 12:56:31.796] [debug] cuda:3 Model memory 29.48GB
[2024-08-05 12:56:31.796] [debug] cuda:3 Decode memory 12.19GB
[2024-08-05 12:56:32.574] [info] cuda:3 using chunk size 4998, batch size 1728
[2024-08-05 12:56:32.575] [debug] cuda:3 Model memory 14.74GB
[2024-08-05 12:56:32.575] [debug] cuda:3 Decode memory 6.09GB
[2024-08-05 12:56:33.242] [debug] BasecallerNode chunk size 9996
[2024-08-05 12:56:33.242] [debug] BasecallerNode chunk size 4998
[2024-08-05 12:56:33.244] [debug] Load reads from file
[2024-08-05 12:56:33.525] [debug] Load reads from file
I got the same issue on chip with SXM4 packing too, just switching to PCIe GPU works, even for some old GPU works like RTX2070.
Just found out that one GPU is faulty, as the model works with the other 3 when used individually.
thanks for the update @paoloinglese
@Mon3trK that shouldn't be the case 🤔 are you having success running other applications on your SXM4 devices?
@iiSeymour Yeah, for Guppy my devices run great
found the issue in one of my GPUs.