av_hubert icon indicating copy to clipboard operation
av_hubert copied to clipboard

OOM when finetuning using multi-GPUs

Open xuan97916 opened this issue 2 years ago • 2 comments

What is your question?

Dear authors, thanks a lot for this great work! I'm getting OOM while finetuning avhubert on my own dataset using multi-GPUs, and this error usually happens on non initial epoch: fairseq-hydra-train --config-dir /my/config --config-name myconfig.yaml hydra.run.dir=../saved_model/20220311_1 common.user_dir=`pwd` distributed_training.ddp_backend=c10d distributed_training.distributed_world_size=4 distributed_training.nprocs_per_node=4

The OOM happens randomly on one GPU:

2022-03-18 21:04:26 | WARNING | fairseq.trainer | OOM: Ran out of memory with exception: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 2; 22.38 GiB total capacity; 21.16 GiB already allocated; 19.94 MiB free; 21.54 GiB reserved in total by PyTorch)
2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 1                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 2                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 1            |        cudaMalloc retries: 8         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| Active memory         |   21660 MB |   21678 MB |   93887 GB |   93865 GB |
|       from large pool |   21640 MB |   21663 MB |   93001 GB |   92980 GB |
|       from small pool |      19 MB |      19 MB |     885 GB |     885 GB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   22062 MB |   22078 MB |   61016 MB |   38954 MB |
|       from large pool |   22040 MB |   22060 MB |   60488 MB |   38448 MB |
|       from small pool |      22 MB |     176 MB |     528 MB |     506 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |  411642 KB |    7842 MB |  189965 GB |  189965 GB |
|       from large pool |  409546 KB |    7828 MB |  188976 GB |  188976 GB |
|       from small pool |    2096 KB |      14 MB |     989 GB |     989 GB |
|---------------------------------------------------------------------------|
| Allocations           |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| Active allocs         |    1810    |    1879    |   28459 K  |   28457 K  |
|       from large pool |     660    |     662    |    6158 K  |    6157 K  |
|       from small pool |    1150    |    1299    |   22300 K  |   22299 K  |
|---------------------------------------------------------------------------|
| GPU reserved segments |     173    |     244    |     572    |     399    |
|       from large pool |     162    |     163    |     308    |     146    |
|       from small pool |      11    |      88    |     264    |     253    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |     144    |     214    |    9561 K  |    9560 K  |
|       from large pool |     135    |     161    |    2333 K  |    2333 K  |
|       from small pool |       9    |      58    |    7227 K  |    7227 K  |
|===========================================================================|

2022-03-18 21:04:26 | WARNING | fairseq.trainer | |===========================================================================|
|                  PyTorch CUDA memory summary, device ID 3                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

I have tried to use no_c10d and pytorch_ddp as ddp_backend and tried to downgrade pytorch to 1.9.1 or 1.8.0 according to this issue and also have checked my dataset (using max_tokens instead of batch_size to prevent long sentence) , but these didn't work for me.

What's your environment?

  • fairseq Version : 1.0.0a0
  • PyTorch Version (e.g., 1.0) : 1.10.0
  • OS : Ubuntu 20.04.2 LTS
  • How you installed fairseq (pip, source): pip
  • Python version: 3.8.12
  • CUDA version: 10.1
  • GPU models and configuration: NVIDIA Tesla P40 / 22919MiB *4
  • Any other relevant information: NVIDIA Driver Version: 470.94

Thanks in advance for your comment!

All the best, An Hsu

xuan97916 avatar Mar 21 '22 04:03 xuan97916

Hi,

What is the dataset.max_tokens did you set? Also what is the maximum utterance length in your dataset? You can try removing the very long utterances from training by setting task.max_sample_size if they aren't too many in your data.

chevalierNoir avatar Mar 21 '22 15:03 chevalierNoir

I will encounter an OOM error halfway through the fifth epoch, and the checkpoint before loading will directly encounter an OOM error. Have you resolved it?

s2485523800 avatar Mar 24 '24 12:03 s2485523800