seq2seq GPU utilization

I'm training a standard NMT model using a single GEFORCE GTX 1080 ti (11GB). During training the model, executing nvidia-smi shows that the GPU volatile utilization is always less than 25% while all CPU cores are busy. Why? screen shot 2017-04-18 at 10 47 14 pm I am assuming that the program is spending much time on communication between CPU and GPU, which is pretty bad.

If anybody had this issue?

The following is my configuration:

python3 -m bin.train \
  --config_paths="
      ./example_configs/nmt_small.yml,
      ./example_configs/train_seq2seq.yml,
      ./example_configs/text_metrics_bpe.yml" \
  --model_params "
      vocab_source: $VOCAB_SOURCE
      vocab_target: $VOCAB_TARGET
      source.max_seq_len: 5
      target.max_seq_len: 30" \
  --input_pipeline_train "
    class: ParallelTextInputPipeline
    params:
      source_files:
        - $TRAIN_SOURCES
      target_files:
        - $TRAIN_TARGETS" \
  --input_pipeline_dev "
    class: ParallelTextInputPipeline
    params:
       source_files:
        - $DEV_SOURCES
       target_files:
        - $DEV_TARGETS" \
  --batch_size 128 \
  --train_steps $TRAIN_STEPS \
  --output_dir $MODEL_DIR \
  --buckets 5,10 \
  --save_checkpoints_steps 900 \
  --eval_every_n_steps 1000

Also, I'm using TF 1.1 compiled from sources, CUDA 8 and CuDNN 5.

@dennybritz Is it possible that the current implementation is optimized for distributed multi-GPU environment as in your experiments? If yes, are there any tips for better utilization on a single GPU?

Apr 18 '17 19:04 amirj

Which data is this on? Toy data or NMT data? I'm not too surprised that this is true for nmt_small.yml as this really is a small model that may not make full use of the GPU. Also, what is the vocabulary size?

Apr 18 '17 19:04 dennybritz

I have my own dataset. My vocabulary size is 10,000 BPE. Changing the model to nmt_large does not solve the problem. For large model, GPU utilization is always under 30-35%.

Apr 18 '17 23:04 amirj

Enlarging the model (increasing vocab size to 40K, add more buckets, increase sequence lengths) leads to better GPU utilization (under 80-90%). Thank you. Why GPU utilization fluctuated between 5% to 90% during the training? I'm using a similar model in Theano, and the GPU utilization is always 100%. Is it something different in TF?

Apr 25 '17 08:04 amirj

According to the standard TF documentation:

Another simple way to check if a GPU is underutilized is to run watch nvidia-smi, and if GPU utilization is not approaching 100% then the GPU is not getting data fast enough.

It seems that there is a problem in the input queues, @dennybritz do you have any possibility to check it on a single GPU?

Apr 25 '17 20:04 amirj

Had the same problem with seq2seq with Attention mechanizm ,tried on 1060 and P100 , no any performance changes. Looks like, Algorithm not use cuda cores

Aug 01 '18 13:08 vladimircape

seq2seq seq2seq copied to clipboard

GPU utilization

seq2seq
seq2seq copied to clipboard