neuralmonkey
neuralmonkey copied to clipboard
Multi-GPU support
Hi, is there a branch where you already started working on this?
Hi, As far as I know, there is no branch dedicated to this issue yet.
I was looking at the possible solutions to this problem and this seemed like a good solution: https://www.tensorflow.org/tutorials/deep_cnn#training_a_model_using_multiple_gpu_cards
Basically, we add an additional option to the [tf_manager] (or maybe [main]) specifying, which gpu devices are available (it would be even better if we could detect them from CUDA_VISIBLE_DEVICE) and create separate graph operations for each gpu device (possibly just by modifying decorators).
The variables would be stored either on CPU or on of the GPU (this should be also specified by a config option). This can be probably done by specifying the PS device on the whole graph. The device for graph operations would be the overriden in the specific sections of code (again, hopefully just by modifying decorators). Also, some changes to the way we update variables will be needed.
This is only a multi-GPU solution, the support for fully distributed computing would probably require some more work. But the multi-GPU solution should be a good starting point.
Have a look, e.g. at this tutorial or TF documentation. I think it looks a little bit better because the graphs can run in separate processes, so they can run even on separate machines. They probably communicate using protocol buffers, so there might be some communication overhead.
@varisd are you willing to look into this? It would be great if we'd finally have this.
I have already assigned the issue to myself and I plan to work on it this week (and if necessary the following weeks).
I am currently swamped by other issues (mainly debugging the ensembles branch) so I am putting this on hold. I created a branch 'multigpu' for this issue and commited my initial changes.
Mostafa H wants to help out with this, so he will keep us updated (hopefully via this thread).
Hi, so the main issue seems to be that 'tf.train.Supervisor' freezes the graph, leading any modifications, such as those in 'runtime_loss' in the decoder to cause this: RuntimeError: Graph is finalized and cannot be modified.
Yes, that's the problem I ran into. The reason why this happens is either:
- lazy building of the compuation graph - we are training to build the graph way after the tf_manager has been initialized (and tf.train.Supervisor have already frozen the graph)
- "incorrect" order of model/config building - the tf_manager is again initialized before the computation graph
I guess we need to move the tf_manager.init_supervisors() call out of the tf_manager.init(). Probably to the runner/training_loop? However, the problem might be somewhere else.
Yeah, I think the exact part which freezes the graph is this:
When tf_manager.initialize_model_parts is called in learning_utils, it calls tf_manager.get_sessions(), which calls sv.prepare_or_wait_for_session, and this is what freezes it, I think, not the tf_manager.init_supervisors().
So then, I think any call to tf_manager.get_sessions() freezes the graph, including this even:
tb_writer = tf.summary.FileWriter(
# log_directory, tf_manager.get_sessions()[0].graph)
I'm not sure of how to avoid that.