CUDA OUT OF MEMORY (Dependent on # of GPUs)
Issue Report
Please describe the issue: CUDA OUT OF MEMORY pyTorch ERROR (Dependent on # of GPUs)
Please provide a clear and concise description of the issue you are seeing and the result you expect. pyTorch fails to allocate memory (even though memory is available in excess per nvidia-smi output) when initializing Dorado across more than 1 A4000 GPU. I would expect Dorado to be able to utilize the two GPUs in parallel without having to limit memory usage manually via batchsize.
Steps to reproduce the issue:
When initializing Dorado with 2X GPUs either by allowing Dorado to automatically detect GPUs, or passing --device cuda:0,1 or passing pyTorch (CUDA_VISIBLE_DEVICES=0,1) pyTorch presents error "CUDA OUT OF MEMORY" occurs. However, if only one GPU is utilized the error does not occur (--device cuda:0 or --device cuda:1). Moreover, limiting batchsize (--batchsize 250) allows Dorado to run to completion. Finally, when we try to switch to the SUP model, we have not found a batchsize which will allow Dorado to run across the 2 GPUs. Note: A4000 GPUs have 16GB of RAM each.
Please list any steps to reproduce the issue.
Run environment:
- Dorado version: 0.5.2
- Cuda version: 12.3 The paths to all libraries included with 0.5.2 are explicitly defined.
- Dorado command: dorado basecaller [email protected] pod5s/ -v --modified-bases 6mA 5mC_5hmC --reference AssemblyScaffolds.fasta --kit-name SQK-NBD114-96 >calls.bam
- Operating system: RHEL 8
- Hardware (CPUs, Memory, GPUs): Intel Xeon Gold 36 core, 96GB RAM, 2X A4000 Nvidia GPUs
- Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
- Source data location (on device or networked drive - NFS, etc.): local SSD
- Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): ~300GB
- Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):
Logs
- Please provide output trace of dorado (run dorado with -v, or -vv on a small subset): I will followup with specific errors when my machine is brought back online.
Currently unavailable, will update when possible.
Hi @ericmsmall, thanks for the detailed report.
When initializing Dorado with 2X GPUs...
Just to clarify - Does the error occur before basecalling starts i.e. during auto batch size selection?
We're continuously working on improving the auto batch size algorithm to get the best out of the hardware to deliver the best basecalling performance while being stable. I'll report your specific hardware configuration back to the team to see what we can do in a future release of Dorado.
The -v verbose output now shows much more information about the auto batch size calculation. This should help guide you to the optimal batch size for your hardware if you wish to experiment further around your known good -b 250 value.
Kind regards, Rich
@HalfPhoton Here are the console outputs from dorado -v. Thanks!
SUP with batchsize 250
dorado basecaller [email protected] pod5s/ -v --modified-bases 6mA 5mC_5hmC --batchsize 250 --reference X.fasta --kit-name SQK-NBD114-96 >calls.bam
[2024-02-07 11:58:08.679] [debug] - matching modification model found: [email protected]_6mA@v2 [2024-02-07 11:58:08.680] [debug] - matching modification model found: [email protected]_5mC_5hmC@v1 [2024-02-07 11:58:08.680] [info] > Creating basecall pipeline [2024-02-07 11:58:26.831] [debug] cuda:1 memory available: 14.54GB [2024-02-07 11:58:26.831] [debug] Auto batchsize cuda:1: memory limit 13.54GB [2024-02-07 11:58:26.831] [debug] Maximum safe estimated batch size for cuda:1: 512 [2024-02-07 11:58:26.831] [debug] Device cuda:1 Model memory 8.73GB [2024-02-07 11:58:26.831] [debug] Device cuda:1 Decode memory 3.61GB [2024-02-07 11:58:26.832] [debug] cuda:0 memory available: 14.57GB [2024-02-07 11:58:26.832] [debug] Auto batchsize cuda:0: memory limit 13.57GB [2024-02-07 11:58:26.832] [debug] Maximum safe estimated batch size for cuda:0: 512 [2024-02-07 11:58:26.832] [debug] Device cuda:0 Model memory 8.73GB [2024-02-07 11:58:26.832] [debug] Device cuda:0 Decode memory 3.61GB [2024-02-07 11:58:27.840] [warning] - set batch size for cuda:0 to 256 [2024-02-07 11:58:27.848] [warning] - set batch size for cuda:1 to 256 [2024-02-07 11:58:27.849] [debug] - adjusted chunk size to match model stride: 10000 -> 9996 [2024-02-07 11:58:29.122] [debug] > Map parameters input by user: dbg print qname=false and aln seq=false. [2024-02-07 11:58:29.710] [debug] Creating barcoding info for kit: SQK-NBD114-96 [2024-02-07 11:58:29.710] [info] Barcode for SQK-NBD114-96 [2024-02-07 11:58:29.711] [debug] - adjusted overlap to match model stride: 500 -> 498 [2024-02-07 11:58:29.727] [debug] Load reads from file Nanno/PAS40213_fail_barcode01_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:29.729] [debug] Load reads from file Nanno/PAS40213_fail_barcode03_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:29.732] [debug] Load reads from file Nanno/PAS40213_fail_barcode04_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.233] [debug] Load reads from file Nanno/PAS40213_fail_barcode05_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.234] [debug] Load reads from file Nanno/PAS40213_fail_barcode06_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.235] [debug] Load reads from file Nanno/PAS40213_fail_barcode07_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.236] [debug] Load reads from file Nanno/PAS40213_fail_barcode09_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.237] [debug] Load reads from file Nanno/PAS40213_fail_barcode10_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.246] [debug] Load reads from file Nanno/PAS40213_fail_barcode11_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.247] [debug] Load reads from file Nanno/PAS40213_fail_barcode12_cf87646e_8cd667f2_0.pod5 [2024-02-07 11:58:30.973] [debug] > Kits to evaluate: 1 [2024-02-07 11:58:35.724] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 1.68 GiB (GPU 1; 15.71 GiB total capacity; 4.89 GiB already allocated; 1.07 GiB free; 5.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying.
HAC or SUP without batchsize
dorado basecaller [email protected] pod5s/ -v --modified-bases 6mA 5mC_5hmC --reference X.fasta --kit-name SQK-NBD114-96 >calls.bam
[2024-02-07 12:00:54.633] [debug] - matching modification model found: [email protected]_6mA@v2
[2024-02-07 12:00:54.633] [debug] - matching modification model found: [email protected]_5mC_5hmC@v1
[2024-02-07 12:00:54.634] [info] > Creating basecall pipeline
[2024-02-07 12:00:56.223] [debug] cuda:0 memory available: 14.60GB
[2024-02-07 12:00:56.223] [debug] Auto batchsize cuda:0: memory limit 13.60GB
[2024-02-07 12:00:56.223] [debug] Auto batchsize cuda:0: testing up to 2048 in steps of 64
[2024-02-07 12:00:56.242] [debug] cuda:1 memory available: 14.51GB
[2024-02-07 12:00:56.242] [debug] Auto batchsize cuda:1: memory limit 13.51GB
[2024-02-07 12:00:56.242] [debug] Auto batchsize cuda:1: testing up to 1984 in steps of 64
[2024-02-07 12:00:56.561] [debug] Auto batchsize cuda:0: 64, time per chunk 2.405376 ms
[2024-02-07 12:00:56.564] [debug] Auto batchsize cuda:1: 64, time per chunk 2.414515 ms
[2024-02-07 12:00:56.880] [debug] Auto batchsize cuda:0: 128, time per chunk 1.241856 ms
[2024-02-07 12:00:56.883] [debug] Auto batchsize cuda:1: 128, time per chunk 1.242438 ms
[2024-02-07 12:00:57.192] [debug] Auto batchsize cuda:0: 192, time per chunk 0.811270 ms
[2024-02-07 12:00:57.195] [debug] Auto batchsize cuda:1: 192, time per chunk 0.809908 ms
[2024-02-07 12:00:57.512] [debug] Auto batchsize cuda:0: 256, time per chunk 0.623080 ms
[2024-02-07 12:00:57.517] [debug] Auto batchsize cuda:1: 256, time per chunk 0.627692 ms
[2024-02-07 12:00:57.834] [debug] Auto batchsize cuda:0: 320, time per chunk 0.501120 ms
[2024-02-07 12:00:57.836] [debug] Auto batchsize cuda:1: 320, time per chunk 0.498966 ms
[2024-02-07 12:00:58.156] [debug] Auto batchsize cuda:0: 384, time per chunk 0.419293 ms
[2024-02-07 12:00:58.160] [debug] Auto batchsize cuda:1: 384, time per chunk 0.421686 ms
[2024-02-07 12:00:58.475] [debug] Auto batchsize cuda:0: 448, time per chunk 0.353337 ms
[2024-02-07 12:00:58.482] [debug] Auto batchsize cuda:1: 448, time per chunk 0.357237 ms
[2024-02-07 12:00:58.792] [debug] Auto batchsize cuda:0: 512, time per chunk 0.309610 ms
[2024-02-07 12:00:58.800] [debug] Auto batchsize cuda:1: 512, time per chunk 0.310585 ms
[2024-02-07 12:00:59.123] [debug] Auto batchsize cuda:0: 576, time per chunk 0.287176 ms
[2024-02-07 12:00:59.125] [debug] Auto batchsize cuda:1: 576, time per chunk 0.282584 ms
[2024-02-07 12:00:59.451] [debug] Auto batchsize cuda:1: 640, time per chunk 0.254069 ms
[2024-02-07 12:00:59.454] [debug] Auto batchsize cuda:0: 640, time per chunk 0.258116 ms
[2024-02-07 12:00:59.780] [debug] Auto batchsize cuda:1: 704, time per chunk 0.233327 ms
[2024-02-07 12:00:59.782] [debug] Auto batchsize cuda:0: 704, time per chunk 0.233123 ms
[2024-02-07 12:01:00.109] [debug] Auto batchsize cuda:1: 768, time per chunk 0.214251 ms
[2024-02-07 12:01:00.111] [debug] Auto batchsize cuda:0: 768, time per chunk 0.213947 ms
[2024-02-07 12:01:00.437] [debug] Auto batchsize cuda:1: 832, time per chunk 0.196385 ms
[2024-02-07 12:01:00.445] [debug] Auto batchsize cuda:0: 832, time per chunk 0.199301 ms
[2024-02-07 12:01:00.766] [debug] Auto batchsize cuda:1: 896, time per chunk 0.183439 ms
[2024-02-07 12:01:00.778] [debug] Auto batchsize cuda:0: 896, time per chunk 0.186150 ms
[2024-02-07 12:01:01.097] [debug] Auto batchsize cuda:1: 960, time per chunk 0.171900 ms
[2024-02-07 12:01:01.110] [debug] Auto batchsize cuda:0: 960, time per chunk 0.172577 ms
[2024-02-07 12:01:01.433] [debug] Auto batchsize cuda:1: 1024, time per chunk 0.163234 ms
[2024-02-07 12:01:01.443] [debug] Auto batchsize cuda:0: 1024, time per chunk 0.162131 ms
[2024-02-07 12:01:01.783] [debug] Auto batchsize cuda:1: 1088, time per chunk 0.160965 ms
[2024-02-07 12:01:01.790] [debug] Auto batchsize cuda:0: 1088, time per chunk 0.157000 ms
[2024-02-07 12:01:02.137] [debug] Auto batchsize cuda:1: 1152, time per chunk 0.146707 ms
[2024-02-07 12:01:02.149] [debug] Auto batchsize cuda:0: 1152, time per chunk 0.148872 ms
[2024-02-07 12:01:02.478] [debug] Auto batchsize cuda:1: 1216, time per chunk 0.139191 ms
[2024-02-07 12:01:02.488] [debug] Auto batchsize cuda:0: 1216, time per chunk 0.138505 ms
[2024-02-07 12:01:02.817] [debug] Auto batchsize cuda:1: 1280, time per chunk 0.130977 ms
[2024-02-07 12:01:02.842] [debug] Auto batchsize cuda:0: 1280, time per chunk 0.133061 ms
[2024-02-07 12:01:03.158] [debug] Auto batchsize cuda:1: 1344, time per chunk 0.127034 ms
[2024-02-07 12:01:03.185] [debug] Auto batchsize cuda:0: 1344, time per chunk 0.127408 ms
[2024-02-07 12:01:03.506] [debug] Auto batchsize cuda:1: 1408, time per chunk 0.122362 ms
[2024-02-07 12:01:03.542] [debug] Auto batchsize cuda:0: 1408, time per chunk 0.125748 ms
[2024-02-07 12:01:03.867] [debug] Auto batchsize cuda:1: 1472, time per chunk 0.121551 ms
[2024-02-07 12:01:03.912] [debug] Auto batchsize cuda:0: 1472, time per chunk 0.124647 ms
[2024-02-07 12:01:04.239] [debug] Auto batchsize cuda:1: 1536, time per chunk 0.120089 ms
[2024-02-07 12:01:04.295] [debug] Auto batchsize cuda:0: 1536, time per chunk 0.123888 ms
[2024-02-07 12:01:04.611] [debug] Auto batchsize cuda:1: 1600, time per chunk 0.113680 ms
[2024-02-07 12:01:04.669] [debug] Auto batchsize cuda:0: 1600, time per chunk 0.116839 ms
[2024-02-07 12:01:04.983] [debug] Auto batchsize cuda:1: 1664, time per chunk 0.110649 ms
[2024-02-07 12:01:05.054] [debug] Auto batchsize cuda:0: 1664, time per chunk 0.114667 ms
[2024-02-07 12:01:05.364] [debug] Auto batchsize cuda:1: 1728, time per chunk 0.108722 ms
[2024-02-07 12:01:05.449] [debug] Auto batchsize cuda:0: 1728, time per chunk 0.113264 ms
[2024-02-07 12:01:05.757] [debug] Auto batchsize cuda:1: 1792, time per chunk 0.107558 ms
[2024-02-07 12:01:05.853] [debug] Auto batchsize cuda:0: 1792, time per chunk 0.111525 ms
[2024-02-07 12:01:06.156] [debug] Auto batchsize cuda:1: 1856, time per chunk 0.106708 ms
[2024-02-07 12:01:06.275] [debug] Auto batchsize cuda:0: 1856, time per chunk 0.110296 ms
[2024-02-07 12:01:06.557] [debug] Auto batchsize cuda:1: 1920, time per chunk 0.104293 ms
[2024-02-07 12:01:06.693] [debug] Auto batchsize cuda:0: 1920, time per chunk 0.108546 ms
[2024-02-07 12:01:06.965] [debug] Auto batchsize cuda:1: 1984, time per chunk 0.102632 ms
[2024-02-07 12:01:06.965] [debug] Device cuda:1 Model memory 9.31GB
[2024-02-07 12:01:06.965] [debug] Device cuda:1 Decode memory 3.84GB
[2024-02-07 12:01:07.621] [info] - set batch size for cuda:0 to 2048
[2024-02-07 12:01:07.682] [info] - set batch size for cuda:1 to 1984
[2024-02-07 12:01:07.683] [debug] - adjusted chunk size to match model stride: 10000 -> 9996
[2024-02-07 12:01:08.523] [debug] > Map parameters input by user: dbg print qname=false and aln seq=false.
[2024-02-07 12:01:09.075] [debug] Creating barcoding info for kit: SQK-NBD114-96
[2024-02-07 12:01:09.075] [info] Barcode for SQK-NBD114-96
[2024-02-07 12:01:09.075] [debug] - adjusted overlap to match model stride: 500 -> 498
[2024-02-07 12:01:09.092] [debug] Load reads from file Nanno/PAS40213_fail_barcode01_cf87646e_8cd667f2_0.pod5
[2024-02-07 12:01:09.094] [debug] Load reads from file Nanno/PAS40213_fail_barcode03_cf87646e_8cd667f2_0.pod5
[2024-02-07 12:01:09.096] [debug] Load reads from file Nanno/PAS40213_fail_barcode04_cf87646e_8cd667f2_0.pod5
[2024-02-07 12:01:09.205] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 8.95 GiB (GPU 0; 15.73 GiB total capacity; 175.37 MiB already allocated; 5.68 GiB free; 254.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF', clearing CUDA cache and retrying.
terminate called after throwing an instance of 'c10::OutOfMemoryError'
what(): CUDA out of memory. Tried to allocate 8.95 GiB (GPU 0; 15.73 GiB total capacity; 175.37 MiB already allocated; 5.68 GiB free; 254.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /pytorch/pyold/c10/cuda/CUDACachingAllocator.cpp:913 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f9724f639b7 in /home/XXXXXXX/Dorado/dorado-0.5.2-linux-x64/bin/../lib/libdorado_torch_lib.so)
frame #1:
Aborted (core dumped)
Those debug traces are showing that you might have other processes running on your GPUs. These can have a negative impact on the auto batch size calculation. Here we see that your ~16GiB card has only ~6GiB free.
[2024-02-07 12:01:09.205] [warning] Caught Torch error 'CUDA out of memory. Tried to allocate 8.95 GiB (GPU 0; 15.73 GiB total capacity; ... 5.68 GiB free;
You should be able to see what else is running on your GPUs by running nvidia-smi.
At first I thought that might be the case as well, but until dorado is started the GPUs are idle and no memory is being used. Like this:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A4000 Off | 00000000:65:00.0 Off | Off |
| 0% 37C P0 7W / 140W | 0MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A4000 Off | 00000000:B3:00.0 Off | Off |
| 0% 39C P0 8W / 140W | 0MiB / 16376MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Hi @ericmsmall
Can you try setting the following environment variable and see if it helps?
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:25
Apologies, I forgot that I had tried this based on other folks posts. Setting this as an environmental variable, or passing directly to pyTorch had no effect.
Couple of more things to try to narrow down the issue -
- Try without the mod bases to see if plain basecalling shows the same problem
- Try one modbase at a time (instead of both 6mA and 5mC_5hmC
- manually set the batch size to something smaller, like
-b 1536or-b 1792 - increase the
max_split_size_mbto something larger, like 64/128
Closing as there's been no reply.