openspeech
openspeech copied to clipboard
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
❓ Questions & Help
안녕하세요, conformer-lstm 모델로 ksponspeech dataset train 코드를 돌려보던 도중 아래와 같은 에러가 발생하여 질문 남깁니다.
처음에 default batch size 32 로 돌렸을 때 CUDA memory out of range 가 나서 bacth size 1로 돌렸을 때 발생하고 있습니다
pytorch-lightning 의 버전 문제라는 의견이 있어 1.2.7, 1.3.0, 1.4.5 로 테스트해봤는데 모두 같은 에러가 나오는 것 같습니다.
아래 커맨드로 실행하였고
HYDRA_FULL_ERROR=1 \
python ./openspeech_cli/hydra_train.py \
dataset=ksponspeech \
dataset.dataset_path=$DATASET_PATH \
dataset.manifest_file_path=$MANIFEST_FILE_PATH \
dataset.test_dataset_path=$DATASET_PATH \
dataset.test_manifest_dir=$TEST_MANIFEST_FILE_PATH \
tokenizer=kspon_character \
model=conformer_lstm \
audio=melspectrogram \
lr_scheduler=warmup_reduce_lr_on_plateau \
trainer=gpu \
criterion=joint_ctc_cross_entropy
에러 로그는 아래와 같습니다
wandb: Run data is saved locally in /home/donguk/Workspace/openspeech/outputs/2021-09-15/11-29-52/wandb/run-20210915_113001-1ic3zpv5
wandb: Run `wandb offline` to turn off syncing.
| Name | Type | Params
-------------------------------------------------------
0 | criterion | JointCTCCrossEntropyLoss | 0
1 | encoder | ConformerEncoder | 115 M
2 | decoder | LSTMAttentionDecoder | 7.3 M
-------------------------------------------------------
123 M Trainable params
0 Non-trainable params
123 M Total params
492.532 Total estimated model params size (MB)
Global seed set to 1
Epoch 0: 0%| | 0/1242045 [00:00<03:02, 6820.01it/s]/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:405: LightningDeprecationWarning: One of the returned values {'learning_rate', 'predictions', 'cross_entropy_loss', 'targets', 'logits', 'ctc_loss'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
warning_cache.deprecation(
Error executing job with overrides: ['dataset=ksponspeech', 'dataset.dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.manifest_file_path=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/kspon_manifest', 'dataset.test_dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.test_manifest_dir=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/test_manifest', 'tokenizer=kspon_character', 'model=conformer_lstm', 'audio=melspectrogram', 'lr_scheduler=warmup_reduce_lr_on_plateau', 'trainer=gpu', 'criterion=joint_ctc_cross_entropy']
Traceback (most recent call last):
File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 62, in <module>
hydra_main()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
_run_hydra(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 56, in hydra_main
trainer.fit(model, data_module)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
self._dispatch()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
self.accelerator.start_training(self)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
return self._run_train()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
self.fit_loop.run()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 395, in _optimizer_step
model_ref.optimizer_step(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 548, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 589, in backward
result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
model.backward(closure_loss, optimizer, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1481, in backward
loss.backward(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`
디버깅을 위해 CUDA_LAUNCH_BLOCKING = 1 옵션을 주고 실행했을 때는 아래와 같이 나옵니다
wandb: Run data is saved locally in /home/donguk/Workspace/openspeech/outputs/2021-09-15/11-09-46/wandb/run-20210915_110954-1r2k6x8s
wandb: Run `wandb offline` to turn off syncing.
| Name | Type | Params
-------------------------------------------------------
0 | criterion | JointCTCCrossEntropyLoss | 0
1 | encoder | ConformerEncoder | 115 M
2 | decoder | LSTMAttentionDecoder | 7.3 M
-------------------------------------------------------
123 M Trainable params
0 Non-trainable params
123 M Total params
492.532 Total estimated model params size (MB)
Global seed set to 1
Epoch 0: 0%| | 0/1242045 [00:00<03:16, 6335.81it/s]/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:405: LightningDeprecationWarning: One of the returned values {'predictions', 'learning_rate', 'cross_entropy_loss', 'targets', 'logits', 'ctc_loss'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
warning_cache.deprecation(
Error executing job with overrides: ['dataset=ksponspeech', 'dataset.dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.manifest_file_path=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/kspon_manifest', 'dataset.test_dataset_path=/home/donguk/nas0/poodle/speech_dataset/wav/aihub2019', 'dataset.test_manifest_dir=/home/donguk/Workspace/openspeech/openspeech/datasets/ksponspeech/test_manifest', 'tokenizer=kspon_character', 'model=conformer_lstm', 'audio=melspectrogram', 'lr_scheduler=warmup_reduce_lr_on_plateau', 'trainer=gpu', 'criterion=joint_ctc_cross_entropy']
Traceback (most recent call last):
File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 62, in <module>
hydra_main()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/main.py", line 48, in decorated_main
_run_hydra(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 377, in _run_hydra
run_and_report(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 214, in run_and_report
raise ex
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 211, in run_and_report
return func()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/utils.py", line 378, in <lambda>
lambda: hydra.run(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/_internal/hydra.py", line 111, in run
_ = ret.return_value
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 233, in return_value
raise self._return_value
File "/home/donguk/miniconda3/lib/python3.9/site-packages/hydra/core/utils.py", line 160, in run_job
ret.return_value = task_function(task_cfg)
File "/home/donguk/Workspace/openspeech/./openspeech_cli/hydra_train.py", line 56, in hydra_main
trainer.fit(model, data_module)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 552, in fit
self._run(model)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 917, in _run
self._dispatch()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 985, in _dispatch
self.accelerator.start_training(self)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 995, in run_stage
return self._run_train()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in _run_train
self.fit_loop.run()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 100, in run
super().run(batch, batch_idx, dataloader_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 147, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 201, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 395, in _optimizer_step
model_ref.optimizer_step(
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 296, in optimizer_step
self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 303, in run_optimizer_step
self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 226, in optimizer_step
optimizer.step(closure=lambda_closure, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
return func(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/optim/adam.py", line 66, in step
loss = closure()
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 235, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 548, in training_step_and_backward
self.backward(result, optimizer, opt_idx)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 589, in backward
result.closure_loss = self.trainer.accelerator.backward(result.closure_loss, optimizer, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 276, in backward
self.precision_plugin.backward(self.lightning_module, closure_loss, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 78, in backward
model.backward(closure_loss, optimizer, *args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1481, in backward
loss.backward(*args, **kwargs)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/tensor.py", line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/home/donguk/miniconda3/lib/python3.9/site-packages/torch/autograd/__init__.py", line 130, in backward
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: misaligned address
dataset은 augmentation 포함해서 약 12만 문장 정도 되며, ksponspeech 사용 중입니다.
개발 환경은 현재 RTX 2080ti 에 ubuntu 18.04 에서 실행 중입니다. gpu 메모리 부족으로 생기는 문제일까요..?
네 conformer-lstm 모델은 사이즈가 상당히 큰 모델입니다. RTX 2080ti로는 메모리가 많이 부족하실거에요.
RTX 2080ti는 메모리가 크지 않아서 다른 모델을 사용하더라도 배치사이즈를 많이 줄이셔야 될 것 같습니다.
@sooftware Which GPU did you use or recommend?
@jun-danieloh Of course, bigger memory is better. However, if it is not feasible, it is recommended to use a smaller model (deeppeech2, las etc.) or to reduce the batch size.
@sooftware Openspeech github 만드실때 혹시 어떤 gpu를 주로 쓰셨었나요?
예전에는 RTX 2080ti로도 했었고, v100으로도 해봤습니다. RTX 2080ti의 경우 LAS 모델에 배치사이즈 4~6 정도로 작게 하거나 길이가 긴 오디오 데이터는 제외하고 학습을 진행하곤 했었습니다.