seq2seq
seq2seq copied to clipboard
GPU utilization
I'm training a standard NMT model using a single GEFORCE GTX 1080 ti (11GB).
During training the model, executing nvidia-smi shows that the GPU volatile utilization is always less than 25% while all CPU cores are busy. Why?
I am assuming that the program is spending much time on communication between CPU and GPU, which is pretty bad.
If anybody had this issue?
The following is my configuration:
python3 -m bin.train \
--config_paths="
./example_configs/nmt_small.yml,
./example_configs/train_seq2seq.yml,
./example_configs/text_metrics_bpe.yml" \
--model_params "
vocab_source: $VOCAB_SOURCE
vocab_target: $VOCAB_TARGET
source.max_seq_len: 5
target.max_seq_len: 30" \
--input_pipeline_train "
class: ParallelTextInputPipeline
params:
source_files:
- $TRAIN_SOURCES
target_files:
- $TRAIN_TARGETS" \
--input_pipeline_dev "
class: ParallelTextInputPipeline
params:
source_files:
- $DEV_SOURCES
target_files:
- $DEV_TARGETS" \
--batch_size 128 \
--train_steps $TRAIN_STEPS \
--output_dir $MODEL_DIR \
--buckets 5,10 \
--save_checkpoints_steps 900 \
--eval_every_n_steps 1000
Also, I'm using TF 1.1 compiled from sources, CUDA 8 and CuDNN 5.
@dennybritz Is it possible that the current implementation is optimized for distributed multi-GPU environment as in your experiments? If yes, are there any tips for better utilization on a single GPU?
Which data is this on? Toy data or NMT data? I'm not too surprised that this is true for nmt_small.yml
as this really is a small model that may not make full use of the GPU. Also, what is the vocabulary size?
I have my own dataset.
My vocabulary size is 10,000 BPE.
Changing the model to nmt_large
does not solve the problem. For large model, GPU utilization is always under 30-35%.
Enlarging the model (increasing vocab size to 40K, add more buckets, increase sequence lengths) leads to better GPU utilization (under 80-90%). Thank you. Why GPU utilization fluctuated between 5% to 90% during the training? I'm using a similar model in Theano, and the GPU utilization is always 100%. Is it something different in TF?
According to the standard TF documentation:
Another simple way to check if a GPU is underutilized is to run
watch nvidia-smi
, and if GPU utilization is not approaching 100% then the GPU is not getting data fast enough.
It seems that there is a problem in the input queues, @dennybritz do you have any possibility to check it on a single GPU?
Had the same problem with seq2seq with Attention mechanizm ,tried on 1060 and P100 , no any performance changes. Looks like, Algorithm not use cuda cores