openspeech icon indicating copy to clipboard operation
openspeech copied to clipboard

RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

Open w4-donguk opened this issue 4 years ago • 5 comments

❓ Questions & Help

안녕하세요, conformer-lstm 모델로 ksponspeech dataset train 코드를 돌려보던 도중 아래와 같은 에러가 발생하여 질문 남깁니다.

처음에 default batch size 32 로 돌렸을 때 CUDA memory out of range 가 나서 bacth size 1로 돌렸을 때 발생하고 있습니다

pytorch-lightning 의 버전 문제라는 의견이 있어 1.2.7, 1.3.0, 1.4.5 로 테스트해봤는데 모두 같은 에러가 나오는 것 같습니다.

아래 커맨드로 실행하였고

HYDRA_FULL_ERROR=1 \
python ./openspeech_cli/hydra_train.py \
        dataset=ksponspeech \
        dataset.dataset_path=$DATASET_PATH \
        dataset.manifest_file_path=$MANIFEST_FILE_PATH \
        dataset.test_dataset_path=$DATASET_PATH \
        dataset.test_manifest_dir=$TEST_MANIFEST_FILE_PATH \
        tokenizer=kspon_character \
        model=conformer_lstm \
        audio=melspectrogram \
        lr_scheduler=warmup_reduce_lr_on_plateau \
        trainer=gpu \
        criterion=joint_ctc_cross_entropy

에러 로그는 아래와 같습니다

wandb: Run data is saved locally in /home/donguk/Workspace/openspeech/outputs/2021-09-15/11-29-52/wandb/run-20210915_113001-1ic3zpv5
wandb: Run `wandb offline` to turn off syncing.


  | Name      | Type                     | Params
-------------------------------------------------------
0 | criterion | JointCTCCrossEntropyLoss | 0
1 | encoder   | ConformerEncoder         | 115 M
2 | decoder   | LSTMAttentionDecoder     | 7.3 M
-------------------------------------------------------
123 M     Trainable params
0         Non-trainable params
123 M     Total params
492.532   Total estimated model params size (MB)
Global seed set to 1
Epoch 0:   0%|                                                                                                                                                                                                                                                 | 0/1242045 [00:00<03:02, 6820.01it/s]/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:405: LightningDeprecationWarning: One of the returned values {'learning_rate', 'predictions', 'cross_entropy_loss', 'targets', 'logits', 'ctc_loss'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
  warning_cache.deprecation(
Error executing job with overrides: ['dataset=ksponspeech', 'dataset.dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.manifest_file_path=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/kspon_manifest', 'dataset.test_dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.test_manifest_dir=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/test_manifest', 'tokenizer=kspon_character', 'model=conformer_lstm', 'audio=melspectrogram', 'lr_scheduler=warmup_reduce_lr_on_plateau', 'trainer=gpu', 'criterion=joint_ctc_cross_entropy']
Traceback (most recent call last):
  File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 62, in <module>
    hydra_main()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 56, in hydra_main
    trainer.fit(model, data_module)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
    self._dispatch()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
    return self._run_train()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 395, in _optimizer_step
    model_ref.optimizer_step(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 548, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 589, in backward
    result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
    model.backward(closure_loss, optimizer, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1481, in backward
    loss.backward(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

디버깅을 위해 CUDA_LAUNCH_BLOCKING = 1 옵션을 주고 실행했을 때는 아래와 같이 나옵니다

wandb: Run data is saved locally in /home/donguk/Workspace/openspeech/outputs/2021-09-15/11-09-46/wandb/run-20210915_110954-1r2k6x8s
wandb: Run `wandb offline` to turn off syncing.


  | Name      | Type                     | Params
-------------------------------------------------------
0 | criterion | JointCTCCrossEntropyLoss | 0
1 | encoder   | ConformerEncoder         | 115 M
2 | decoder   | LSTMAttentionDecoder     | 7.3 M
-------------------------------------------------------
123 M     Trainable params
0         Non-trainable params
123 M     Total params
492.532   Total estimated model params size (MB)
Global seed set to 1
Epoch 0:   0%|                                                                                                                                                                                                                                                 | 0/1242045 [00:00<03:16, 6335.81it/s]/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:405: LightningDeprecationWarning: One of the returned values {'predictions', 'learning_rate', 'cross_entropy_loss', 'targets', 'logits', 'ctc_loss'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
  warning_cache.deprecation(
Error executing job with overrides: ['dataset=ksponspeech', 'dataset.dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.manifest_file_path=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/kspon_manifest', 'dataset.test_dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.test_manifest_dir=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/test_manifest', 'tokenizer=kspon_character', 'model=conformer_lstm', 'audio=melspectrogram', 'lr_scheduler=warmup_reduce_lr_on_plateau', 'trainer=gpu', 'criterion=joint_ctc_cross_entropy']
Traceback (most recent call last):
  File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 62, in <module>
    hydra_main()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
    _run_hydra(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
    run_and_report(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
    raise ex
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
    return func()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
    lambda: hydra.run(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
    _ = ret.return_value
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
    raise self._return_value
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 56, in hydra_main
    trainer.fit(model, data_module)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
    self._run(model)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
    self._dispatch()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
    self.accelerator.start_training(self)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
    self._results = trainer.run_stage()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
    return self._run_train()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
    self.fit_loop.run()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
    epoch_output = self.epoch_loop.run(train_dataloader)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
    batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
    super().run(batch, batch_idx, dataloader_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
    self.advance(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
    result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 395, in _optimizer_step
    model_ref.optimizer_step(
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
    result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 548, in training_step_and_backward
    self.backward(result, optimizer, opt_idx)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 589, in backward
    result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
    self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
    model.backward(closure_loss, optimizer, *args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1481, in backward
    loss.backward(*args, **kwargs)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: CUDA error: misaligned address

dataset은 augmentation 포함해서 약 12만 문장 정도 되며, ksponspeech 사용 중입니다.

개발 환경은 현재 RTX 2080ti 에 ubuntu 18.04 에서 실행 중입니다. gpu 메모리 부족으로 생기는 문제일까요..?

w4-donguk avatar Sep 16 '21 01:09 w4-donguk

네 conformer-lstm 모델은 사이즈가 상당히 큰 모델입니다. RTX 2080ti로는 메모리가 많이 부족하실거에요.
RTX 2080ti는 메모리가 크지 않아서 다른 모델을 사용하더라도 배치사이즈를 많이 줄이셔야 될 것 같습니다.

sooftware avatar Sep 16 '21 03:09 sooftware

@sooftware Which GPU did you use or recommend?

jun-danieloh avatar Sep 16 '21 05:09 jun-danieloh

@jun-danieloh Of course, bigger memory is better. However, if it is not feasible, it is recommended to use a smaller model (deeppeech2, las etc.) or to reduce the batch size.

sooftware avatar Sep 16 '21 08:09 sooftware

@sooftware Openspeech github 만드실때 혹시 어떤 gpu를 주로 쓰셨었나요?

jun-danieloh avatar Sep 16 '21 10:09 jun-danieloh

예전에는 RTX 2080ti로도 했었고, v100으로도 해봤습니다. RTX 2080ti의 경우 LAS 모델에 배치사이즈 4~6 정도로 작게 하거나 길이가 긴 오디오 데이터는 제외하고 학습을 진행하곤 했었습니다.

sooftware avatar Sep 16 '21 13:09 sooftware