keras-extras
keras-extras copied to clipboard
Error with make_parallel function
I got the following error while trying to use make_parallel function,
Traceback (most recent call last):
File "model_language2motion.py", line 1335, in <module>
main(parser.parse_args())
File "model_language2motion.py", line 1202, in main
args.func(args)
File "model_language2motion.py", line 723, in train
train_data, valid_data, model, optimizer = prepare_for_training(output_path, args)
File "model_language2motion.py", line 677, in prepare_for_training
model = make_parallel(model, 8)
File "/workspace/deepAnim/make_parallel.py", line 31, in make_parallel
outputs = model(inputs)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 572, in __call__
self.add_inbound_node(inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 635, in add_inbound_node
Node.create_node(self, inbound_layers, node_indices, tensor_indices)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 172, in create_node
output_tensors = to_list(outbound_layer.call(input_tensors, mask=input_masks))
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2247, in call
output_tensors, output_masks, output_shapes = self.run_internal_graph(inputs, masks)
File "/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py", line 2390, in run_internal_graph
computed_mask))
File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 235, in call
constants = self.get_constants(x)
File "/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py", line 884, in get_constants
ones = K.tile(ones, (1, int(input_dim)))
TypeError: int() argument must be a string or a number, not 'NoneType'
PS: The code works if the call to make_parallel is removed.
My two cents on the problem. I’ve been working on a g2.8xlarge as well, but I came across a similar issue. I manage to work around it by making the total number of samples divisible by the batch size. If you are multiplying your batch size by the number of gpus then your samples must be divisible by that equivalent batch size. For example, if you have 257000 samples and a per gpu batch of 16 (128 for 8 GPUs), then pass to the model a slice of 256000. I’m not sure if this is your case. Let me know how it goes.
My initial conclusion was wrong. I had been running different configurations on g2.8xlarge and p2.8xlarge so that that the model could fit on the smaller cards K520. But strangely it seems to be somehow related to batch normalisation. It works only when I use batch normalisation. I couldn't figure out how exactly they are related yet.
The one with the batch-norm works the other doesn't. Further, digging down it seems that only the first batch normalisation layer is important for the network to work. I tried to replace it with a linear activation layer but did not work.