algorithmic-efficiency
algorithmic-efficiency copied to clipboard
LibriSpeech Conformer Workload OOMs or NCCL Errors When Run With Multiple Trials
We consistently observe an OOM error when running the one of the NAdamW baselines on LibriSpeech Conformer with multiple trials in PyTorch on 8 V100s with 16GB each. This is run for the external ruleset. The first trial will successfully run through, but any subsequent trial will OOM.
If we try to resume a multi-trial run, we will observe a NCCL error. This occurs even if we delete the trial_2
folder (but the trial_1
folder remains intact).
Description
As discussed above, we will observe OOM when running LibriSpeech Conformer with the NAdamW baseline with multiple trials on 8 V100s with 16GB each. This is an example of an OOM we observe on the subsequent trial:
I0229 11:25:52.179392 139670264227648 submission_runner.py:314] Starting training loop. [1007/1887]
I0229 11:25:54.777722 139642405861120 logging_writer.py:48] [0] global_step=0, grad_norm=56.284924, loss=31.354658
I0229 11:25:54.783623 139670264227648 pytorch_nadamw_full_budget.py:296] 0) loss = 31.355, grad_norm = 56.285
I0229 11:25:55.234757 139670264227648 spec.py:321] Evaluating on the training split.
I0229 11:26:08.470401 139670264227648 spec.py:333] Evaluating on the validation split.
I0229 11:26:20.220777 139670264227648 spec.py:349] Evaluating on the test split.
I0229 11:26:26.181465 139670264227648 submission_runner.py:414] Time since start: 34.00s, Step: 1, {'train/ctc_loss': 30.998378480849
826, 'train/wer': 1.253173088944333, 'validation/ctc_loss': 29.436887863844042, 'validation/wer': 1.157427702409115, 'validation/num_examples': 53
48, 'test/ctc_loss': 29.538330745441247, 'test/wer': 1.1866837283935572, 'test/num_examples': 2472, 'score': 2.605729579925537, 'total_duration':
34.00228404998779, 'accumulated_submission_time': 2.605729579925537, 'accumulated_eval_time': 30.946528434753418, 'accumulated_logging_time': 0}
I0229 11:26:26.201718 139642405861120 logging_writer.py:48] [1] accumulated_eval_time=30.946528, accumulated_logging_time=0, accumulated_submissio
n_time=2.605730, global_step=1, preemption_count=0, score=2.605730, test/ctc_loss=29.538331, test/num_examples=2472, test/wer=1.186684, total_dura
tion=34.002284, train/ctc_loss=30.998378, train/wer=1.253173, validation/ctc_loss=29.436888, validation/num_examples=5348, validation/wer=1.157428
I0229 11:26:26.931042 139670264227648 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/nadamw_test/librispeech_conformer_pytorch/tria
l_4/checkpoint_1.
I0229 11:26:28.782347 139642397468416 logging_writer.py:48] [1] global_step=1, grad_norm=60.090302, loss=31.114534
I0229 11:26:28.785351 139670264227648 pytorch_nadamw_full_budget.py:296] 1) loss = 31.115, grad_norm = 60.090
I0229 11:26:30.031821 139642405861120 logging_writer.py:48] [2] global_step=2, grad_norm=72.279572, loss=30.439383
I0229 11:26:30.035584 139670264227648 pytorch_nadamw_full_budget.py:296] 2) loss = 30.439, grad_norm = 72.280
I0229 11:26:30.954263 139642397468416 logging_writer.py:48] [3] global_step=3, grad_norm=108.894348, loss=29.346493
I0229 11:26:30.957295 139670264227648 pytorch_nadamw_full_budget.py:296] 3) loss = 29.346, grad_norm = 108.894
Traceback (most recent call last):
File "submission_runner.py", line 697, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "submission_runner.py", line 665, in main
score = score_submission_on_workload(
File "submission_runner.py", line 576, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "submission_runner.py", line 336, in train_once
optimizer_state, model_params, model_state = update_params(
File "/algorithmic-efficiency/submissions/baseline_submission/pytorch_nadamw_full_budget.py", line 276, in update_params
loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 4 has a total capacty of 15.77 GiB of which 865.31 MiB is free. P
rocess 443130 has 14.92 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 7.69 GiB is reserved by PyTorch but unall
ocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management
and PYTORCH_CUDA_ALLOC_CONF
Alternatively, if we try to resume the multi-trial run, we will observe the NCCL error:
I0225 07:05:33.506379 140182089410368 submission_runner.py:589] Timing: 61068.13836145401
I0225 07:05:33.506450 140182089410368 submission_runner.py:591] Total number of evals: 47
I0225 07:05:33.506514 140182089410368 submission_runner.py:592] ====================
I0225 07:05:33.506577 140182089410368 submission_runner.py:545] Using RNG seed 1817859550
I0225 07:05:33.507572 140182089410368 submission_runner.py:554] --- Tuning run 2/5 ---
I0225 07:05:33.507675 140182089410368 submission_runner.py:559] Creating tuning directory at /experiment_runs/nad
amw_baseline/librispeech_conformer_pytorch/trial_2.
I0225 07:05:33.507940 140182089410368 logger_utils.py:92] Saving hparams to /experiment_runs/nadamw_baseline/libr
ispeech_conformer_pytorch/trial_2/hparams.json.
I0225 07:05:33.508602 140182089410368 submission_runner.py:206] Initializing dataset.
I0225 07:05:33.508736 140182089410368 input_pipeline.py:20] Loading split = train-clean-100
I0225 07:05:33.534054 140182089410368 input_pipeline.py:20] Loading split = train-clean-360
I0225 07:05:33.866440 140182089410368 input_pipeline.py:20] Loading split = train-other-500
I0225 07:05:34.242223 140182089410368 submission_runner.py:213] Initializing model.
[E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800075 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800070 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800128 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800597 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800573 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800614 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUD
A kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=5, OpType=BROADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000
) ran for 1800128 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation
timeout: WorkNCCL(SeqNum=5, OpType=BROADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800128
milliseconds before timing out.
Fatal Python error: Aborted
cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki
Steps to Reproduce
In the Docker container, run:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
--standalone \
--nnodes=1 \
--nproc_per_node=8 \
submission_runner.py \
--framework=pytorch \
--data_dir=/data/librispeech/ \
--workload=librispeech_conformer \
--experiment_dir=/experiment_runs \
--experiment_name=librispeech_conformer_baseline \
--submission_path=reference_algorithms/paper_baselines/nadamw/pytorch/submission.py \
--tuning_search_space=reference_algorithms/paper_baselines/nadamw/tuning_search_space.json \
--librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab \
--num_tuning_trials=5
Source or Possible Fix
We are not aware of a possible fix for this issue. We suspect there may be a memory leak in the PyTorch LibriSpeech Conformer workload. Please let us know how to proceed. Thanks in advance!
What is the version of the GPU driver? Can you print the output of nvidia-smi in your environment?
We had a similar issue a few months ago https://github.com/mlcommons/algorithmic-efficiency/issues/497
Our error is deterministically reproducible.
Here is the nvidia-smi output in the Docker container:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14 Driver Version: 550.54.14 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100-SXM2-16GB Off | 00000000:00:17.0 Off | 0 |
| N/A 65C P0 296W / 300W | 7213MiB / 16384MiB | 81% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100-SXM2-16GB Off | 00000000:00:18.0 Off | 0 |
| N/A 54C P0 237W / 300W | 9027MiB / 16384MiB | 82% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 Tesla V100-SXM2-16GB Off | 00000000:00:19.0 Off | 0 |
| N/A 53C P0 146W / 300W | 8951MiB / 16384MiB | 95% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 Tesla V100-SXM2-16GB Off | 00000000:00:1A.0 Off | 0 |
| N/A 64C P0 103W / 300W | 8921MiB / 16384MiB | 88% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 Tesla V100-SXM2-16GB Off | 00000000:00:1B.0 Off | 0 |
| N/A 58C P0 79W / 300W | 8991MiB / 16384MiB | 79% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 Tesla V100-SXM2-16GB Off | 00000000:00:1C.0 Off | 0 |
| N/A 57C P0 80W / 300W | 9007MiB / 16384MiB | 73% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 Tesla V100-SXM2-16GB Off | 00000000:00:1D.0 Off | 0 |
| N/A 52C P0 73W / 300W | 9091MiB / 16384MiB | 63% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 Tesla V100-SXM2-16GB Off | 00000000:00:1E.0 Off | 0 |
| N/A 60C P0 82W / 300W | 9053MiB / 16384MiB | 93% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Ok I see, sorry I did not catch that it was consistently OOMing.
We can try to clear the cache between trials.
Could you try adding _reset_cuda_mem()
in submission_runner.py
, line 542 ( https://github.com/mlcommons/algorithmic-efficiency/blob/main/submission_runner.py#L542) and let me know if that resolves the issue?
@chandramouli-sastry @pomonam @msaroufim do you have any other ideas?
Thank you for the suggestion! We've confirmed that adding _reset_cuda_mem()
solves this issue. However, we are now seeing that trial 5 consistently OOMs, even when run on its own (and regardless of whether _reset_cuda_mem()
is added). Here's what the error looks like:
I0301 22:57:47.301353 140438874752832 submission_runner.py:314] Starting training loop. [153/1900]
I0301 22:57:51.364732 140412236850944 logging_writer.py:48] [0] global_step=0, grad_norm=35.902340, loss=31.931053
I0301 22:57:51.381769 140438874752832 pytorch_nadamw_full_budget.py:296] 0) loss = 31.931, grad_norm = 35.902
I0301 22:57:51.775843 140438874752832 spec.py:321] Evaluating on the training split.
I0301 22:57:51.776754 140438874752832 input_pipeline.py:20] Loading split = train-clean-100
I0301 22:57:51.803838 140438874752832 input_pipeline.py:20] Loading split = train-clean-360
I0301 22:57:51.906127 140438874752832 input_pipeline.py:20] Loading split = train-other-500
I0301 22:58:06.571050 140438874752832 spec.py:333] Evaluating on the validation split.
I0301 22:58:06.572633 140438874752832 input_pipeline.py:20] Loading split = dev-clean
I0301 22:58:06.576136 140438874752832 input_pipeline.py:20] Loading split = dev-other
I0301 22:58:17.383088 140438874752832 spec.py:349] Evaluating on the test split.
I0301 22:58:17.384322 140438874752832 input_pipeline.py:20] Loading split = test-clean
I0301 22:58:22.858831 140438874752832 submission_runner.py:414] Time since start: 35.56s, Step: 1, {'train/ctc_loss': 31.21902909067492, 'train/wer': 2.106045848546352, 'validation/ctc_loss': 3
0.200058752570424, 'validation/wer': 1.494539661082412, 'validation/num_examples': 5348, 'test/ctc_loss': 30.30062109410089, 'test/wer': 1.5860296955294213, 'test/num_examples': 2472, 'score': 4.08172893524
1699, 'total_duration': 35.55763244628906, 'accumulated_submission_time': 4.081728935241699, 'accumulated_eval_time': 31.082834243774414, 'accumulated_logging_time': 0}
I0301 22:58:22.872175 140410752063232 logging_writer.py:48] [1] accumulated_eval_time=31.082834, accumulated_logging_time=0, accumulated_submission_time=4.081729, global_step=1, preemption_count=0, score=4.
081729, test/ctc_loss=30.300621, test/num_examples=2472, test/wer=1.586030, total_duration=35.557632, train/ctc_loss=31.219029, train/wer=2.106046, validation/ctc_loss=30.200059, validation/num_examples=534
8, validation/wer=1.494540
I0301 22:58:23.482473 140438874752832 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/shampoo/tmp/librispeech_conformer_pytorch/trial_5/checkpoint_1.
I0301 22:58:25.199719 140410533967616 logging_writer.py:48] [1] global_step=1, grad_norm=35.834045, loss=31.748371
I0301 22:58:25.202767 140438874752832 pytorch_nadamw_full_budget.py:296] 1) loss = 31.748, grad_norm = 35.834
I0301 22:58:26.362531 140410752063232 logging_writer.py:48] [2] global_step=2, grad_norm=43.194149, loss=31.490442
I0301 22:58:26.365588 140438874752832 pytorch_nadamw_full_budget.py:296] 2) loss = 31.490, grad_norm = 43.194
Traceback (most recent call last):
File "submission_runner.py", line 698, in <module>
app.run(main)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "submission_runner.py", line 666, in main
score = score_submission_on_workload(
File "submission_runner.py", line 577, in score_submission_on_workload
timing, metrics = train_once(workload, workload_name,
File "submission_runner.py", line 336, in train_once
optimizer_state, model_params, model_state = update_params(
File "/algorithmic-efficiency/submissions/baseline_submission/pytorch_nadamw_full_budget.py", line 276, in update_params
loss.backward()
File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 492, in backward
torch.autograd.backward(
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 4 has a total capacty of 15.77 GiB of which 767.31 MiB is free. Process 610974 has 15.01 GiB memory in use. Of the allocated
memory 6.26 GiB is allocated by PyTorch, and 7.79 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documenta
tion for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This trial corresponds to the following hyperparameter set:
{
"dropout_rate": 0.1,
"label_smoothing": 0.0,
"learning_rate": 0.0017486387539278373,
"one_minus_beta1": 0.06733926164,
"beta2": 0.9955159689799007,
"weight_decay": 0.08121616522670176,
"warmup_factor": 0.02
}
We observed that changing dropout_rate
to 0.0 no longer OOMs, as does halving the batch size. However, we haven't been able to run the baseline with this set of hyperparameters as-is.
Ok thanks for the update. I will make a fix to reset the cuda mem in our repo. We will try to reproduce and investigate the dropout issue on our end.
Also, I assume you're running these to produce some baseline logs? I checked in logs on our dev branch for under prize_qualification_baselines/external_tuning/logs
. These were produced with JAX fyi which is probably why we haven't noticed this issue. But I hope this will unblock you for the time being.
Thank you so much! Yes, we wanted to sanity check the baseline on our setup. Thank you for pointing us towards the logs, they will indeed be helpful :)
It seems like @pomonam was unable to reproduce this issue with dropout.
We probably have a fix though, could you modify https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py and pass in inplace=True
in the nn.Dropout initializations. This should reduce the memory footprint of the dropout layer.
If this fixes things for you, we'll make this change as well.
From offline discussion w @anana10c and team it sounds like setting inplace=True
resolved the OOM issue.
@pomonam could you send over a PR with
-
_reset_cuda_mem()
addition at the beginning of each trial. -
inplace=True
in the PyTorch nn.Dropout layer initializations.
Sorry for the delayed response - unfortunately, we've found that while inplace=True
lets the trial 5 run continue for longer, it still ends up OOMing at around iteration 3600.