DeepQA icon indicating copy to clipboard operation
DeepQA copied to clipboard

Multiple GPUs

Open jroakes opened this issue 8 years ago • 6 comments

The training takes some time at 40 time steps. Interested in seeing if this can run on multiple GPU's using an AWS P2 instance. Do you have any sense of how complicated that would be to implement with the current coding?

jroakes avatar Oct 31 '16 11:10 jroakes

For now the program does not support multi-GPU. Here is a example of multi GPU training if you need to implement it: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/cifar10/cifar10_multi_gpu_train.py

Conchylicultor avatar Oct 31 '16 15:10 Conchylicultor

Unfortunately, it seems distributed seq2seq is not straightforward. The only approach seems to be splitting the LSTM hidden layers accross GPUs but doing so can result in one GPU having to wait for the result of the previous one and thus not really improving the pipeline.

Long discussion on tensorflow google group but with no clear solution. If someone has ideas/papers to refer to, please share. It would be nice to speed up this training on large corpora like opensubtitles.

https://groups.google.com/a/tensorflow.org/d/msg/discuss/idWem3kxsqE/kwK_KAfHAwAJ

eschnou avatar Dec 19 '16 20:12 eschnou

Here is another interesting link on the subject, comparing data paralelism and model paralelism for seq2seq. It seems the model paralelism is really the way to go but challenging to implement in tensorflow (see above comment).

http://www.linhaibin.com/mxnet/

eschnou avatar Dec 20 '16 20:12 eschnou

There will be some major changes with tensoflow 1.0 for the seq2seq API. If multi-GPU is implemented someday, I don't think it would be pertinent to add that before tf 1.0 release. As you say, it's apparently non trivial (also, the first link you put isn't working for me).

Conchylicultor avatar Dec 20 '16 23:12 Conchylicultor

Wrong copy paste. Link is updated.

eschnou avatar Dec 21 '16 07:12 eschnou

I wonder how tensorflow-gpu is used with gpu, when I uninstall tensorflow(1.2.0). Now in my environment tensorflow 1.2.0 is removed and tensorflow-gpu is left only. But the program is running with error, showing that:

PS C:\Users\Administrator\PycharmProjects\DeepQA> python3 .\main.py
Welcome to DeepQA v0.1 !

Traceback (most recent call last):
  File ".\main.py", line 29, in <module>
    chatbot.main()
  File "C:\Users\Administrator\PycharmProjects\DeepQA\chatbot\chatbot.py", line 145, in main
    print('TensorFlow detected: v{}'.format(tf.__version__))
AttributeError: module 'tensorflow' has no attribute '__version__'

obviously DeepQA meets something wrong without tensorflow package. While I suppose tensorflow-gpu is an substitute for tensorflow, it is not working now. Any one would help, thanks.

wiwengweng avatar Jan 15 '18 08:01 wiwengweng