neuralmonkey icon indicating copy to clipboard operation
neuralmonkey copied to clipboard

Multi-GPU support

Open jindrahelcl opened this issue 7 years ago • 10 comments

jindrahelcl avatar Jul 07 '17 17:07 jindrahelcl

Hi, is there a branch where you already started working on this?

StoyanVenDimitrov avatar Sep 10 '17 12:09 StoyanVenDimitrov

Hi, As far as I know, there is no branch dedicated to this issue yet.

varisd avatar Oct 04 '17 13:10 varisd

I was looking at the possible solutions to this problem and this seemed like a good solution: https://www.tensorflow.org/tutorials/deep_cnn#training_a_model_using_multiple_gpu_cards

Basically, we add an additional option to the [tf_manager] (or maybe [main]) specifying, which gpu devices are available (it would be even better if we could detect them from CUDA_VISIBLE_DEVICE) and create separate graph operations for each gpu device (possibly just by modifying decorators).

The variables would be stored either on CPU or on of the GPU (this should be also specified by a config option). This can be probably done by specifying the PS device on the whole graph. The device for graph operations would be the overriden in the specific sections of code (again, hopefully just by modifying decorators). Also, some changes to the way we update variables will be needed.

This is only a multi-GPU solution, the support for fully distributed computing would probably require some more work. But the multi-GPU solution should be a good starting point.

varisd avatar Oct 10 '17 09:10 varisd

Have a look, e.g. at this tutorial or TF documentation. I think it looks a little bit better because the graphs can run in separate processes, so they can run even on separate machines. They probably communicate using protocol buffers, so there might be some communication overhead.

jlibovicky avatar Oct 10 '17 11:10 jlibovicky

@varisd are you willing to look into this? It would be great if we'd finally have this.

jindrahelcl avatar Oct 10 '17 12:10 jindrahelcl

I have already assigned the issue to myself and I plan to work on it this week (and if necessary the following weeks).

varisd avatar Oct 10 '17 12:10 varisd

I am currently swamped by other issues (mainly debugging the ensembles branch) so I am putting this on hold. I created a branch 'multigpu' for this issue and commited my initial changes.

Mostafa H wants to help out with this, so he will keep us updated (hopefully via this thread).

varisd avatar Oct 23 '17 14:10 varisd

Hi, so the main issue seems to be that 'tf.train.Supervisor' freezes the graph, leading any modifications, such as those in 'runtime_loss' in the decoder to cause this: RuntimeError: Graph is finalized and cannot be modified.

mhany90 avatar Oct 29 '17 17:10 mhany90

Yes, that's the problem I ran into. The reason why this happens is either:

  • lazy building of the compuation graph - we are training to build the graph way after the tf_manager has been initialized (and tf.train.Supervisor have already frozen the graph)
  • "incorrect" order of model/config building - the tf_manager is again initialized before the computation graph

I guess we need to move the tf_manager.init_supervisors() call out of the tf_manager.init(). Probably to the runner/training_loop? However, the problem might be somewhere else.

varisd avatar Oct 31 '17 09:10 varisd

Yeah, I think the exact part which freezes the graph is this:

When tf_manager.initialize_model_parts is called in learning_utils, it calls tf_manager.get_sessions(), which calls sv.prepare_or_wait_for_session, and this is what freezes it, I think, not the tf_manager.init_supervisors().

So then, I think any call to tf_manager.get_sessions() freezes the graph, including this even:

tb_writer = tf.summary.FileWriter(
      #  log_directory, tf_manager.get_sessions()[0].graph)

I'm not sure of how to avoid that.

mhany90 avatar Nov 02 '17 15:11 mhany90