av_hubert
av_hubert copied to clipboard
OOM when finetuning using multi-GPUs
What is your question?
Dear authors, thanks a lot for this great work! I'm getting OOM while finetuning avhubert on my own dataset using multi-GPUs, and this error usually happens on non initial epoch:
fairseq-hydra-train --config-dir /my/config --config-name myconfig.yaml hydra.run.dir=../saved_model/20220311_1 common.user_dir=`pwd` distributed_training.ddp_backend=c10d distributed_training.distributed_world_size=4 distributed_training.nprocs_per_node=4
The OOM happens randomly on one GPU:
2022-03-18 21:04:26 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 2; 22.38 GiB total capacity; 21.16 GiB already allocated; 19.94 MiB free; 21.54 GiB reserved in total by PyTorch)
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 1 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 2 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 1 | cudaMalloc retries: 8 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 21660 MB | 21678 MB | 93887 GB | 93865 GB |
| from large pool | 21640 MB | 21663 MB | 93001 GB | 92980 GB |
| from small pool | 19 MB | 19 MB | 885 GB | 885 GB |
|---------------------------------------------------------------------------|
| Active memory | 21660 MB | 21678 MB | 93887 GB | 93865 GB |
| from large pool | 21640 MB | 21663 MB | 93001 GB | 92980 GB |
| from small pool | 19 MB | 19 MB | 885 GB | 885 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory | 22062 MB | 22078 MB | 61016 MB | 38954 MB |
| from large pool | 22040 MB | 22060 MB | 60488 MB | 38448 MB |
| from small pool | 22 MB | 176 MB | 528 MB | 506 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory | 411642 KB | 7842 MB | 189965 GB | 189965 GB |
| from large pool | 409546 KB | 7828 MB | 188976 GB | 188976 GB |
| from small pool | 2096 KB | 14 MB | 989 GB | 989 GB |
|---------------------------------------------------------------------------|
| Allocations | 1810 | 1879 | 28459 K | 28457 K |
| from large pool | 660 | 662 | 6158 K | 6157 K |
| from small pool | 1150 | 1299 | 22300 K | 22299 K |
|---------------------------------------------------------------------------|
| Active allocs | 1810 | 1879 | 28459 K | 28457 K |
| from large pool | 660 | 662 | 6158 K | 6157 K |
| from small pool | 1150 | 1299 | 22300 K | 22299 K |
|---------------------------------------------------------------------------|
| GPU reserved segments | 173 | 244 | 572 | 399 |
| from large pool | 162 | 163 | 308 | 146 |
| from small pool | 11 | 88 | 264 | 253 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 144 | 214 | 9561 K | 9560 K |
| from large pool | 135 | 161 | 2333 K | 2333 K |
| from small pool | 9 | 58 | 7227 K | 7227 K |
|===========================================================================|
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
| PyTorch CUDA memory summary, device ID 3 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
I have tried to use no_c10d
and pytorch_ddp
as ddp_backend and tried to downgrade pytorch to 1.9.1 or 1.8.0 according to this issue and also have checked my dataset (using max_tokens instead of batch_size to prevent long sentence) , but these didn't work for me.
What's your environment?
- fairseq Version : 1.0.0a0
- PyTorch Version (e.g., 1.0) : 1.10.0
- OS : Ubuntu 20.04.2 LTS
- How you installed fairseq (pip, source): pip
- Python version: 3.8.12
- CUDA version: 10.1
- GPU models and configuration: NVIDIA Tesla P40 / 22919MiB *4
- Any other relevant information: NVIDIA Driver Version: 470.94
Thanks in advance for your comment!
All the best, An Hsu
Hi,
What is the dataset.max_tokens
did you set? Also what is the maximum utterance length in your dataset? You can try removing the very long utterances from training by setting task.max_sample_size
if they aren't too many in your data.
I will encounter an OOM error halfway through the fifth epoch, and the checkpoint before loading will directly encounter an OOM error. Have you resolved it?