research icon indicating copy to clipboard operation
research copied to clipboard

ValueError: Shapes are not compatible when training transition model after autoencoder trained

Open chqsark opened this issue 8 years ago • 26 comments

Hi,

I trained the autoencoder successfully. However, when I was doing followup steps in training the transition model, I had the problem below. image

Thanks a lot for any help!

BTW, I didn't change any code.

chqsark avatar Sep 02 '16 18:09 chqsark

can someone take a look? thanks!

chqsark avatar Sep 07 '16 21:09 chqsark

Have you solved the problem? I have the same problem as you.

lxgen avatar Sep 09 '16 06:09 lxgen

this is weird, see that you have two outputs with shape (?, 64, 512) and (9, 2, 512). They should be (64, 5, 512) and (64, 9, 512). But K.rnn is messing the shapes up. I'll check what is going on.

In case you want to investigate as well, here is where the bug should be happening https://github.com/commaai/research/blob/master/models/layers.py#L359-L374

What is your keras version by the way?

EderSantana avatar Sep 13 '16 01:09 EderSantana

So, for the now the only place that I see could be cause this problem is the consume_less RNN parameter in Keras. Try changing https://github.com/commaai/research/blob/master/models/transition.py#L41-L42 to:

model.add(DreamyRNN(output_dim=z_dim, output_length=out_leng-1, return_sequences=True,
                    activation="tanh", consume_less="not_cpu", batch_input_shape=(batch_size, time, z_dim)))

Unfortunately I can't reproduce your bug right now. But I'll give you more information as soon as I get an opportunity.

EderSantana avatar Sep 13 '16 01:09 EderSantana

@EderSantana Thanks a lot for the response. I've tried keras 1.0.6 and 1.0.8, tensorflow 0.9, and 0.10. All gave the same error. I still got the error after changing transition.py as you suggested.

I realized that comma.ai has a fork of keras. Should I use that instead of the original one? Or any specific branch of keras?

chqsark avatar Sep 13 '16 16:09 chqsark

no I tried this code on Keras public release. I think the problem is with the recurrent layer consume_less parameter. But I can't test it right now :(

EderSantana avatar Sep 13 '16 16:09 EderSantana

I just tried 'cpu', 'gpu', 'mem' for consume_less parameter. No luck :(

My server.py output is like this guan.wang@Z440SJ-243:~/ml/comma/research$ ./server.py --time 60 --batch 64 INFO:main:server started INFO:dask_generator:Loading 9 hdf5 buckets. x 52722 | t 263583 | f 52722 x 58993 | t 294919 | f 58993 x 19731 | t 98719 | f 19731 x 56166 | t 280785 | f 56166 x 25865 | t 129344 | f 25865 x 85296 | t 426596 | f 85296 x 78463 | t 392182 | f 78463 x 30538 | t 152650 | f 30538 x 51691 | t 258571 | f 51691 training on 436627/459465 examples INFO:dask_generator:camera files 9 4296.05 ms X (64, 60, 3, 160, 320) angle (64, 60, 1) speed (64, 60, 1)

chqsark avatar Sep 13 '16 16:09 chqsark

@chqsark thanks for the information. I'll continue investigating.

EderSantana avatar Sep 13 '16 17:09 EderSantana

Suffering the same problem. Any thoughts?

kamal94 avatar Sep 14 '16 01:09 kamal94

I have solved the problem by changing the Keras version from 1.0.8 to 1.0.6.

lxgen avatar Sep 14 '16 02:09 lxgen

I also solved it by completely removing keras and install the 1.0.6 version. Previously I tried virtualenv for 1.0.6 and it didn't work. Maybe my package system messed it up. Now it started running. Just the server side generates the following periodically.

Traceback (most recent call last): File "/home/guan.wang/ml/comma/research/dask_generator.py", line 109, in datagen X_batch[count] = x[i-es-time_len+1:i-es+1] File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 419, in getitem selection = sel.select(self.shape, args, dsid=self.id) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 91, in select sel[args] File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 258, in getitem start, count, step, scalar = _handle_simple(self.shape,args) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 509, in _handle_simple x,y,z = _translate_slice(arg, length) File "/usr/lib/python2.7/dist-packages/h5py/_hl/selections.py", line 550, in _translate_slice raise ValueError("Reverse-order selections are not allowed") ValueError: Reverse-order selections are not allowed

chqsark avatar Sep 14 '16 16:09 chqsark

@EderSantana I also have the same situation. After install Keras 1.0.6 and start the training of transition, there is two kind of errors in the server side. One is the "ValueError: Reverse-order selections are not allowed" Traceback (most recent call last): File "/home/yale/research/dask_generator.py", line 109, in datagen X_batch[count] = x[i-es-time_len+1:i-es+1] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/dataset.py", line 462, in getitem selection = sel.select(self.shape, args, dsid=self.id) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 92, in select sel[args] File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 259, in getitem start, count, step, scalar = _handle_simple(self.shape,args) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 443, in _handle_simple x,y,z = _translate_slice(arg, length) File "/home/yale/anaconda2/envs/tensorflow/lib/python2.7/site-packages/h5py/_hl/selections.py", line 484, in _translate_slice raise ValueError("Reverse-order selections are not allowed") ValueError: Reverse-order selections are not allowed

The other is the "could not broadcast input array from shape (5,1) into shape (60,1)" Traceback (most recent call last): File "/home/yale/research/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-time_len+1:i+1])[:, None] ValueError: could not broadcast input array from shape (5,1) into shape (60,1)

Is it related to the different Keras version?

Yale323 avatar Sep 15 '16 12:09 Yale323

@chqsark how did you completely remove keras ? Was it out of your virtualenv or under your virtualenv or conda environment ?

@EderSantana I got this issue "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 246, in conv2d_shape padding)

File "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 184, in get2d_conv_output_size (row_stride, col_stride), padding_type)

File "/home/dev-box/anaconda2/envs/python2/lib/python2.7/site-packages/tensorflow/python/framework/common_shapes.py", line 149, in get_conv_output_size "Filter: %r Input: %r" % (filter_size, input_size))

ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (3, 160)

Keras 1.0.6 and TF 0.10, Theano 0.8.2

andrewraharjo avatar Sep 19 '16 18:09 andrewraharjo

@andrewraharjo I'm seeing the same issue "ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (3, 160)" on an AWS GPU instance with TF 0.10 and Keras 1.1.0. I don't get the issue if running locally on a MacBook Pro with TF 0.9 and Keras 1.0.8. I'll try setting up the AWS instance with TF 0.9.

jamesjackson avatar Sep 23 '16 23:09 jamesjackson

@jamesjackson It seems that it is related with current TF build.Checkout this link I haven't given up yet with TF but the way I start the training is using Theano backend 0.8.2, keras 1.1.0 with cuDNN 5.1 though it's recommended running cuDNN 5.0. If you can't keep going with TF then try to modify your keras.json to theano and update your theanorc file by changing CPU to GPU. Oh by the way I'm not running AWS, I have use stationary dev-box

andrewraharjo avatar Sep 23 '16 23:09 andrewraharjo

guys, if you are using the new tensorflow and keras make sure to pass unroll=True as input parameters to RNN layers. I had this problem with other layers as well

EderSantana avatar Sep 24 '16 03:09 EderSantana

Thanks @andrewraharjo , @EderSantana

It appears to be an odd environmental issue related to the packaging and/or Anaconda. I tried several TF/Keras versions, and they all failed in the same way. Building from source and avoiding Anaconda does work.

jamesjackson avatar Sep 24 '16 15:09 jamesjackson

@jamesjackson I was thinking that way earlier and I verified with my buddy who installed use Anaconda3 and setup the virtualenv for Python 2.7. He could run with TF and I was confused why the Anaconda2 won't work. Did you solve this problem by building from source and avoid anaconda ?

andrewraharjo avatar Sep 24 '16 16:09 andrewraharjo

As a note, my tensorflow was installed from source as well. (but I did use anaconda)

EderSantana avatar Sep 24 '16 17:09 EderSantana

@andrewraharjo Yeah, source-based without Anaconda is working.

jamesjackson avatar Sep 24 '16 18:09 jamesjackson

@jamesjackson Yes, source-based without Anaconda +1 @andrewraharjo Yes, I completely removed keras and reinstalled the right version.

chqsark avatar Oct 04 '16 04:10 chqsark

I got the same erro when I am going to run the code to train the transition model:

The error in server side: Traceback (most recent call last): File "/home/sky/research/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-time_len+1:i+1])[:, None] ValueError: could not broadcast input array from shape (5,1) into shape (60,1)

Does anyone know the solutions?

skywong1230 avatar Oct 23 '16 16:10 skywong1230

have you run the/view_generative_model.py transition --name transition successfull?

zhaohuaqing1993 avatar Mar 20 '17 15:03 zhaohuaqing1993

Traceback (most recent call last): File "/home/deep-learning/research-master/dask_generator.py", line 112, in datagen angle_batch[count] = np.copy(angle[i-time_len+1:i+1])[:, None] ValueError: could not broadcast input array from shape (55,1) into shape (60,1) same problem occured~

pandamax avatar May 07 '17 09:05 pandamax

in the view steering model.py file I found his error (ValueError: bad marshal data (unknown type code)) result when trying to execute the view steering model.py here is the result from the cmd prompt

Traceback (most recent call last): File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 229, in func_load raw_code = codecs.decode(code.encode('ascii'), 'base64') UnicodeEncodeError: 'ascii' codec can't encode character '\xe0' in position 46: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "view_steering_model.py", line 94, in model = model_from_json(json.load(jfile)) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\models.py", line 349, in model_from_json return layer_module.deserialize(config, custom_objects=custom_objects) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layers_init_.py", l ine 55, in deserialize printable_module_name='layer') File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 144, in deserialize_keras_object list(custom_objects.items()))) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\models.py", line 1349, in from_config layer = layer_module.deserialize(conf, custom_objects=custom_objects) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layers_init_.py", l ine 55, in deserialize printable_module_name='layer') File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 144, in deserialize_keras_object list(custom_objects.items()))) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\layers\core.py", line 711, in from_config function = func_load(config['function'], globs=globs) File "C:\Users\lenovo\Anaconda3\lib\site-packages\keras\utils\generic_utils.py ", line 234, in func_load code = marshal.loads(raw_code) ValueError: bad marshal data (unknown type code)

ahmedyahia3393 avatar Apr 19 '18 22:04 ahmedyahia3393

looks like the issue is from Keras, which version are you using?

kingxueyuf avatar Apr 19 '18 22:04 kingxueyuf