dorado icon indicating copy to clipboard operation
dorado copied to clipboard

[email protected] crashes my server

Open piplus2 opened this issue 1 year ago • 3 comments
trafficstars

Issue Report

When I perform basecalling using the model [email protected], my server crashes.

Please describe the issue:

Please provide a clear and concise description of the issue you are seeing and the result you expect.

After launching the basecalling using the model [email protected], the server gets stuck and I have to reboot it. This does not happen with the previous model [email protected]

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

  • Dorado version: 0.7.2
  • Dorado command: dorado basecaller --min-qscore 10 /[email protected] >
  • Operating system: Ubuntu 20.04
  • Hardware (CPUs, Memory, GPUs):
    • 4 x A100 80GB
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance):
    • pod5
  • Source data location (on device or networked drive - NFS, etc.):
    • local
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB):
    • 15 pod5 files
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)

piplus2 avatar Aug 02 '24 15:08 piplus2

Hey @paoloinglese can you post the output from nvidia-smi?

iiSeymour avatar Aug 02 '24 15:08 iiSeymour

Hi @iiSeymour ,

here it is:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:06:00.0 Off |                    0 |
| N/A   31C    P0             61W /  500W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   28C    P0             60W /  500W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:08:00.0 Off |                    0 |
| N/A   31C    P0             60W /  500W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:09:00.0 Off |                    0 |
| N/A   27C    P0             58W /  500W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

piplus2 avatar Aug 05 '24 12:08 piplus2

These are the last lines of the output before it crashes:

[2024-08-05 11:39:56.431] [info] cuda:0 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:0 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:0 Decode memory 4.44GB
[2024-08-05 11:39:56.431] [info] cuda:3 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:3 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:3 Decode memory 4.44GB
[2024-08-05 11:39:56.431] [debug] Largest batch size for cuda:1: 512, time per chunk 0.345104 ms
[2024-08-05 11:39:56.431] [debug] Final batch size for cuda:1[0]: 512
[2024-08-05 11:39:56.431] [debug] Final batch size for cuda:1[1]: 512
[2024-08-05 11:39:56.431] [info] cuda:2 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.431] [debug] cuda:2 Model memory 36.54GB
[2024-08-05 11:39:56.431] [debug] cuda:2 Decode memory 4.44GB
[2024-08-05 11:39:56.435] [info] cuda:1 using chunk size 12288, batch size 512
[2024-08-05 11:39:56.435] [debug] cuda:1 Model memory 36.54GB
[2024-08-05 11:39:56.435] [debug] cuda:1 Decode memory 4.44GB

while these are from v4.3.0:

[2024-08-05 12:56:30.338] [debug] Final batch size for cuda:1[0]: 1728
[2024-08-05 12:56:30.338] [debug] Final batch size for cuda:1[1]: 1728
[2024-08-05 12:56:30.349] [info] cuda:1 using chunk size 9996, batch size 1728
[2024-08-05 12:56:30.350] [debug] cuda:1 Model memory 29.48GB
[2024-08-05 12:56:30.350] [debug] cuda:1 Decode memory 12.19GB
[2024-08-05 12:56:31.135] [info] cuda:1 using chunk size 4998, batch size 1728
[2024-08-05 12:56:31.135] [debug] cuda:1 Model memory 14.74GB
[2024-08-05 12:56:31.135] [debug] cuda:1 Decode memory 6.09GB
[2024-08-05 12:56:31.500] [debug] Auto batchsize cuda:3: 6848, time per chunk 0.083813 ms
[2024-08-05 12:56:31.794] [debug] Largest batch size for cuda:3: 1728, time per chunk 0.079983 ms
[2024-08-05 12:56:31.794] [debug] Final batch size for cuda:3[0]: 1728
[2024-08-05 12:56:31.794] [debug] Final batch size for cuda:3[1]: 1728
[2024-08-05 12:56:31.796] [info] cuda:3 using chunk size 9996, batch size 1728
[2024-08-05 12:56:31.796] [debug] cuda:3 Model memory 29.48GB
[2024-08-05 12:56:31.796] [debug] cuda:3 Decode memory 12.19GB
[2024-08-05 12:56:32.574] [info] cuda:3 using chunk size 4998, batch size 1728
[2024-08-05 12:56:32.575] [debug] cuda:3 Model memory 14.74GB
[2024-08-05 12:56:32.575] [debug] cuda:3 Decode memory 6.09GB
[2024-08-05 12:56:33.242] [debug] BasecallerNode chunk size 9996
[2024-08-05 12:56:33.242] [debug] BasecallerNode chunk size 4998
[2024-08-05 12:56:33.244] [debug] Load reads from file 
[2024-08-05 12:56:33.525] [debug] Load reads from file

piplus2 avatar Aug 05 '24 13:08 piplus2

I got the same issue on chip with SXM4 packing too, just switching to PCIe GPU works, even for some old GPU works like RTX2070.

Mon3trK avatar Aug 09 '24 12:08 Mon3trK

Just found out that one GPU is faulty, as the model works with the other 3 when used individually.

piplus2 avatar Aug 09 '24 12:08 piplus2

thanks for the update @paoloinglese

@Mon3trK that shouldn't be the case 🤔 are you having success running other applications on your SXM4 devices?

iiSeymour avatar Aug 09 '24 12:08 iiSeymour

@iiSeymour Yeah, for Guppy my devices run great

Mon3trK avatar Aug 10 '24 09:08 Mon3trK

found the issue in one of my GPUs.

piplus2 avatar Aug 12 '24 10:08 piplus2