TensorFlowASR
TensorFlowASR copied to clipboard
Low GPU utilization for multi-GPU training
I have trained a Conformer model using my own custom dataset in Thai. However, GPU Utilization seems to be pretty low as the training speed is pretty slow (~2 s/batch). The GPU was utilized by around 5-10%. Are there anyways to debug this problem?
For training, I simply edited config.yaml
in examples/conformer/config.yaml
and run
$ python examples/conformer/train_conformer.py --device 0 1 2 3
Software Specification: OS: Debian GNU/Linux 10 (buster) (GNU/Linux 4.19.0-9-cloud-amd64 x86_64\n) GPUs: Nvidia Tesla V100 16Gb RAM Installed by building from source
config.yaml
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: vocabularies/thai.characters
target_vocab_size: 1024
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
positional_encoding: sinusoid_concat
dmodel: 144
num_blocks: 16
head_size: 36
num_heads: 4
mha_type: relmha
kernel_size: 32
fc_factor: 0.5
dropout: 0.1
embed_dim: 320
embed_dropout: 0.1
num_rnns: 1
rnn_units: 320
rnn_type: lstm
layer_norm: True
joint_dim: 320
learning_config:
augmentations:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
dataset_config:
train_paths:
- /home/chompk/trainv1_trainscript.tsv
eval_paths:
- /home/chompk/valv1_trainscript.tsv
test_paths:
- /mnt/d/SpeechProcessing/Datasets/LibriSpeech/test-clean/transcripts.tsv
tfrecords_dir: null
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 4
accumulation_steps: 4
num_epochs: 20
outdir: /mnt/d/SpeechProcessing/Trained/local/conformer
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
GPU Utilization
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 58W / 300W | 15752MiB / 16130MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 38C P0 66W / 300W | 15704MiB / 16130MiB | 5% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 66W / 300W | 15752MiB / 16130MiB | 4% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 39C P0 58W / 300W | 15704MiB / 16130MiB | 7% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9598 C python 15741MiB |
| 1 9598 C python 15693MiB |
| 2 9598 C python 15741MiB |
| 3 9598 C python 15693MiB |
+-----------------------------------------------------------------------------+
Training Steps Example
> Start evaluation ...
[Eval] [Step 1000] |████████████████████| 4423/4423 [22:08<00:00, 3.33batch/s, transducer_loss=171.865
> End evaluation ...
[Train] [Epoch 1/20] | | 1500/796100 [1:38:28<421:04:34, 1.91s/batch, transducer_loss=159.42458]
> Start evaluation ...
[Eval] [Step 1500] |████████████████████| 4423/4423 [23:15<00:00, 3.17batch/s, transducer_loss=153.2395]
> End evaluation ...
[Train] [Epoch 1/20] | | 2000/796100 [2:18:06<456:58:56, 2.07s/batch, transducer_loss=140.7582]
> Start evaluation ...
[Eval] [Step 2000] |████████████████████| 4423/4423 [22:36<00:00, 3.26batch/s, transducer_loss=137.00543]
> End evaluation ...
[Train] [Epoch 1/20] | | 2500/796100 [2:57:05<409:56:45, 1.86s/batch, transducer_loss=126.64603]
> Start evaluation ...
[Eval] [Step 2500] |████████████████████| 4423/4423 [22:52<00:00, 3.22batch/s, transducer_loss=126.15583]
> End evaluation ...
[Train] [Epoch 1/20] | | 2648/796100 [3:23:48<506:25:46, 2.30s/batch, transducer_loss=125.96002
This is weird, try using --mxp
, mixed precision is faster. Anyway, I've tested on rtx 2080ti and gpu usage is around 30-70%.
you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util
you could also try caching the dataset using the cache flag if you have enough ram. After the first epoch I had at least 2 batch/s if I remember correctly using T4 with high gpu-util
Sorry for the stupid question, how do I use cache flag?
@tann9949 sure just pass --cache
in your training call you can use tfrecords too --tfrecords
and specify a directory for them to be stored in the tfrecords_dir:
inside your config.yml
the records will be created for you if they don't exist yet the first time you run the training script
@bill-kalog Thanks for the reply!
@tann9949 sure just pass
--cache
in your training call you can use tfrecords too--tfrecords
and specify a directory for them to be stored in thetfrecords_dir:
inside yourconfig.yml
the records will be created for you if they don't exist yet the first time you run the training script
I've tried this method and this speeds up by ~1.2 s/batch. Still, GPU utilization is still merely 5%
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 58W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 39C P0 58W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 40C P0 57W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 39C P0 57W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 18286 C python 15741MiB |
| 1 18286 C python 15693MiB |
| 2 18286 C python 15741MiB |
| 3 18286 C python 15693MiB |
+-----------------------------------------------------------------------------+
I'm not sure if it's about my sequence length of the audio file, my maximum sequence length is around 30 seconds. I have no idea how to debug this problem.
@tann9949 I noticed from step 2500, the speed reduces from 3.1batch/s to 2s/batch. Is the usage still around 5% within that first 2500 steps?
The 2s/batch was during training, 3.1batch/s was during eval. The GPU usage is around 0-5% during training while 20-40% during eval
@tann9949 my bad, so the problem might lie in optimizer.apply_gradients
for tape.gradients
. Can you try running with train_ga_conformer.py
? It's gradient accumulation training
For some reason, this gets even worse
@tann9949 Yeah, as expected, because it runs with larger batch size. Can you train on librispeech? So that I can see whether it is because of GPU or the code or the data.
@usimarit I've run model training on librispeech with this configuration
config.yaml
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: null
target_vocab_size: 1024
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
positional_encoding: sinusoid_concat
dmodel: 144
num_blocks: 16
head_size: 36
num_heads: 4
mha_type: relmha
kernel_size: 32
fc_factor: 0.5
dropout: 0.1
embed_dim: 320
embed_dropout: 0.1
num_rnns: 1
rnn_units: 320
rnn_type: lstm
layer_norm: True
joint_dim: 320
learning_config:
augmentations:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
dataset_config:
train_paths:
- /home/chompk/librispeech/LibriSpeech/train-clean-100/transcript.tsv
eval_paths:
- /home/chompk/librispeech/LibriSpeech/dev-clean/transcript.tsv
- /home/chompk/librispeech/LibriSpeech/dev-other/transcript.tsv
test_paths:
- /home/chompk/librispeech/LibriSpeech/test-clean/transcript.tsv
tfrecords_dir: /home/chompk/tfrecords_data
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 4
accumulation_steps: 4
num_epochs: 20
outdir: /home/chompk/conformer_libri
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
Still, GPU utilization is around 0%
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 63W / 300W | 15704MiB / 16130MiB | 9% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 40C P0 69W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 39C P0 65W / 300W | 15752MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 60W / 300W | 15704MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 9007 C python 15693MiB |
| 1 9007 C python 15741MiB |
| 2 9007 C python 15741MiB |
| 3 9007 C python 15693MiB |
+-----------------------------------------------------------------------------+
The training was around 2.06s/batch (without gradient accumulation)
[Train] [Epoch 1/20] |▏ | 258/35660 [12:16<22:02:21, 2.24s/batch, transducer_loss=1021.1291]
I've also tried Librispeech training with train_ga_conformer.py
. Still have a worse performance but better GPU utilization
[Train] [Epoch 1/20] | | 8/8900 [10:58<141:30:15, 57.29s/batch, transducer_loss=1505.1484]
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:00:04.0 Off | 0 |
| N/A 39C P0 94W / 300W | 15626MiB / 16130MiB | 15% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:00:05.0 Off | 0 |
| N/A 41C P0 69W / 300W | 15626MiB / 16130MiB | 21% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:00:06.0 Off | 0 |
| N/A 39C P0 66W / 300W | 15626MiB / 16130MiB | 32% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:00:07.0 Off | 0 |
| N/A 40C P0 60W / 300W | 15626MiB / 16130MiB | 25% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 15032 C python 15615MiB |
| 1 15032 C python 15615MiB |
| 2 15032 C python 15615MiB |
| 3 15032 C python 15615MiB |
+-----------------------------------------------------------------------------+
@tann9949 So the problem is not from dataset, so it might be tensorflow and this GPU V100
@tann9949 how is the CPU utilization? is it 100%?
I've used 8 vCPUs, 30 GB memory. Each CPU core usage was around 40-50%. I'm not sure whether it's a bottleneck on feature extraction or not but from what I've tried, Using TFRecords tends to speed up the most (from ~2.5s/batch -> ~1.4s/batch). Using --mxp
and --cache
doesn't help that much.
I think you can try to reproduce my error by training Librispeech on google's VM using image pytorch-1-4-cu101
.
@tann9949 please use https://www.tensorflow.org/tensorboard/tensorboard_profiling_keras to trace what is the reason :)))
I'm also suffering from log GPU utilization, but even with a single GPU. See graph below. Detals:
- TensorFlowASR v0.7.1
-
train_ga_conformer
- config - see below.
If I try to increase batch size - it fails with OOM instantly so it's the best I could get.
Config:
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: vocabularies/lithuanian.characters
target_vocab_size: 4096
max_subword_length: 4
blank_at_zero: True
beam_width: 5
norm_score: True
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid_concat
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: False
prediction_projection_units: 0
joint_dim: 640
joint_activation: tanh
learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /tf_asr/manifests/cc_manifest_train.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-train
shuffle: True
cache: True
buffer_size: 100
drop_remainder: True
eval_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/cc_manifest_eval.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-eval
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
test_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/cc_manifest_test.tsv
tfrecords_dir: /tf_asr/tfrecords/cc/tfrecords-test
shuffle: False
cache: True
buffer_size: 100
drop_remainder: True
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 8
accumulation_steps: 16
num_epochs: 20
outdir: /tf_asr/models
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
checkpoint:
filepath: /tf_asr/models/checkpoints/{epoch:02d}.h5
save_best_only: True
save_weights_only: False
save_freq: epoch
states_dir: /tf_asr/models/states
tensorboard:
log_dir: /tf_asr/models/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: 'epoch'
profile_batch: 2
I've been testing some models and I see that models which does not have RNN will take like 90-100% GPU utilization and models have at least 1 RNN will take like from 25-70% GPU utilization. Do you guys have any idea to improve GPU utilization for RNN?
I also suffered by GPU utilized only low and occasionaly. Switching to Keras helped.
Do you guys still have this issue?
I solved the problem by installing the cuda and cudnn drives through sudo from linux, for some reason, when installing through anaconda this problem happens