keras-extras icon indicating copy to clipboard operation
keras-extras copied to clipboard

Error with make_parallel function

Open AgrawalAmey opened this issue 7 years ago • 3 comments

I got the following error while trying to use make_parallel function,

Traceback (most recent call last):
  File "model_language2motion.py", line 1335, in <module>
    main(parser.parse_args())
  File "model_language2motion.py", line 1202, in main
    args.func(args)
  File "model_language2motion.py", line 723, in train
    train_data, valid_data, model, optimizer = prepare_for_training(output_path, args)
  File "model_language2motion.py", line 677, in prepare_for_training
    model = make_parallel(model, 8)
  File "/workspace/deepAnim/make_parallel.py", line 31, in make_parallel
    outputs = model(inputs)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in __call__
    self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 635, in add_inbound_node
    Node.create_node(self, inbound_layers, node_indices, tensor_indices)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 172, in create_node
    output_tensors = to_list(outbound_layer.call(input_tensors, mask=input_masks))
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2247, in call
    output_tensors, output_masks, output_shapes = self.run_internal_graph(inputs, masks)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2390, in run_internal_graph
    computed_mask))
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 235, in call
    constants = self.get_constants(x)
  File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 884, in get_constants
    ones = K.tile(ones, (1, int(input_dim)))
TypeError: int() argument must be a string or a number, not 'NoneType'

PS: The code works if the call to make_parallel is removed.

AgrawalAmey avatar Jun 27 '17 11:06 AgrawalAmey

My two cents on the problem. I’ve been working on a g2.8xlarge as well, but I came across a similar issue. I manage to work around it by making the total number of samples divisible by the batch size. If you are multiplying your batch size by the number of gpus then your samples must be divisible by that equivalent batch size. For example, if you have 257000 samples and a per gpu batch of 16 (128 for 8 GPUs), then pass to the model a slice of 256000. I’m not sure if this is your case. Let me know how it goes.

ChristianLagares avatar Jun 30 '17 09:06 ChristianLagares

My initial conclusion was wrong. I had been running different configurations on g2.8xlarge and p2.8xlarge so that that the model could fit on the smaller cards K520. But strangely it seems to be somehow related to batch normalisation. It works only when I use batch normalisation. I couldn't figure out how exactly they are related yet.

AgrawalAmey avatar Jun 30 '17 10:06 AgrawalAmey

The one with the batch-norm works the other doesn't. Further, digging down it seems that only the first batch normalisation layer is important for the network to work. I tried to replace it with a linear activation layer but did not work.

screenshot from 2017-06-30 18-04-32 screenshot from 2017-06-30 18-04-52

AgrawalAmey avatar Jun 30 '17 12:06 AgrawalAmey