vitamin_c Cannot run on custom data

When trying to train the CVAE on data I created on my own, I get the error posted below. Note that I'm trying to run the code in Python 3.7, which is officially not supported. However, I got the same error when I was trying to run the code locally in Python 3.6. On this local machine on the other hand I do not have access to a graphics card and was running it on the CPU, which is not supported for training if I remember correctly.

As it is a re-shaping error I guess it is due to the data-format I'm using and that the training data isn't quite in the correct shape. I deduced, that your code expects the training data to be of shape (number training samples, number detectors, number samples per timeseries). My custom training-data contains only the keys ['rand_pars', 'snrs', 'x_data', 'y_data_noisefree', 'y_data_noisy', 'y_normscale'] with respective shapes [(9,), (1000,3), (1000,1,9), (1000, 3, 256), (1000, 3, 256), ()] For the test data I deduced that the code expects single samples and thus the data to be of shape (number detectors, number samples per timeseries). My test set therefore contains the same keys as the training set, which have the respective shapes [(9,), (3,), (1,9), (3, 256), (3, 256), ()].

The full error message:

WARNING:tensorflow:AutoGraph could not transform <function train.<locals>.truncnorm at 0x14c61ddb7c80> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: Bad argument number for Name: 4, expecting 3
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert

... Training Inference Model

Traceback (most recent call last):
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 128 values, but the requested shape has 64
	 [[{{node Reshape}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/work/marlin.schaefer/projects/collab_glasgow/vitamin_b/vitamin_b/run_vitamin.py", line 1471, in <module>
    train(params,bounds,fixed_vals)
  File "/work/marlin.schaefer/projects/collab_glasgow/vitamin_b/vitamin_b/run_vitamin.py", line 888, in train
    XS_all,snrs_test) 
  File "/work/marlin.schaefer/projects/collab_glasgow/vitamin_b/vitamin_b/models/CVAE_model.py", line 734, in train
    session.run(minimize, feed_dict={bs_ph:bs, x_ph:next_x_data, y_ph:next_y_data, ramp:rmp}) 
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 958, in run
    run_metadata_ptr)
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1181, in _run
    feed_dict_tensor, options, run_metadata)
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/work/marlin.schaefer/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input to reshape is a tensor with 128 values, but the requested shape has 64
	 [[node Reshape (defined at /projects/collab_glasgow/vitamin_b/vitamin_b/models/CVAE_model.py:618) ]]

Original stack trace for 'Reshape':
  File "/projects/collab_glasgow/vitamin_b/vitamin_b/run_vitamin.py", line 1471, in <module>
    train(params,bounds,fixed_vals)
  File "/projects/collab_glasgow/vitamin_b/vitamin_b/run_vitamin.py", line 888, in train
    XS_all,snrs_test)
  File "/projects/collab_glasgow/vitamin_b/vitamin_b/models/CVAE_model.py", line 618, in train
    con = tf.reshape(tf.math.reciprocal(temp_var_r2_sky),[bs_ph])   # modelling wrapped scale output as log variance - only 1 concentration parameter for all sky
  File "/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/ops/array_ops.py", line 193, in reshape
    result = gen_array_ops.reshape(tensor, shape, name)
  File "/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 8087, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 744, in _apply_op_helper
    attrs=attr_protos, op_def=op_def)
  File "/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3327, in _create_op_internal
    op_def=op_def)
  File "/envs/vitamin3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1791, in __init__
    self._traceback = tf_stack.extract_stack()

Sep 02 '20 12:09 MarlinSchaefer

@MarlinSchaefer Thanks for raising the issue. I'll see if I get time today to take a look at this. Will send another message when I've figured out what's going on.

Sep 04 '20 13:09 hagabbar

@MarlinSchaefer Right so it's probably a shape issue you're dealing with. I checked my code and the I think you may have the detector and sample rate switched around. Your x_data should end up having a shape of (number of training samples, number of parameters to infer) and your y_data_noisy/y_data_noisefree arrays should have the shape (number of training samples, sample rate, number of detectors).

Also, I'm not sure if this will break the code, but it's probably safe to also make sure that your test data has an extra dimension at the beginning for the number of test samples (even if you're using just 1 test sample). i.e. (number of test samples, number of parameters to infer) for x_data_test and (number of test samples, sample rate, number of detectors) for y_data_test_noisy/y_data_test_noisefree.

Sep 04 '20 14:09 hagabbar

@hagabbar I've tested switching the channels. So for 1000 signals I now have the shape (1000, 256, 3). However, I still get the same error.

I've also tested using an additional dimension for the test-data. Doing so causes the code to crash when re-shaping the y-data (line 722 in run_vitamin). It seems to be assembling the data assuming that it is a single sample in load_data.

Sep 04 '20 15:09 MarlinSchaefer

I've looked a bit more into this, but I'm having a hard time understanding everything your code does. The error occurs on line 618 in CVAE_model.py, where you try to reshape some tensor to [batch_size]. I'm not familiar with tensorflow, so what should this code do and why use resahpe instead of flatten? Could this maybe just be a version-issue?

Sep 05 '20 15:09 MarlinSchaefer

@hagabbar I've had a closer look at the errors/code again. So I've found one problem and was able to resolve it. However, that caused further problems.

So in line 618 of CVAE_model.py there is the line: con = tf.reshape(tf.math.reciprocal(temp_var_r2_sky),[bs_ph]) I'm not sure exactly why this reshape is needed and what it is aiming to achieve. However, the shapes don't match, as temp_var_r2_sky is of shape (batch-size, number of inferred sky parameters). In my case it would be of shape (128, 2). So I would guess that the reshape should be either con = tf.reshape(tf.math.reciprocal(temp_var_r2_sky),[bs_ph, sky_len]) or con = tf.reshape(tf.math.reciprocal(temp_var_r2_sky),[bs_ph * sky_len]).

Either of them passes the reshape but crashes on line 626 reconstr_loss_sky = von_mises_fisher.log_prob(tf.math.l2_normalize(xyz_unit,axis=1)) This is due to a shape mismatch between loc_sky and scale_sky from VI_decoder_r2.py. And this is where I can't understand the shapes anymore. In my mind the scale and the loc of a normal distribution should be of the same shape. But in the code loc_sky is explicitly set to have at least 1 more dimension than scale_sky. The comments say that this is due to the 3rd sky parameter (polarization or distance, I'm guessing) but I'm not sure what the reason for this is.

Sep 15 '20 12:09 MarlinSchaefer

Marlin, I think this is because the sky parameters output from the decoder are designed to be 3D in the sense that they are modelled using the Fisher Von Mises distribution which describes a Gaussian-like blob of probability on the 2-sphere (sky). The Tensorflow probability functions model this with a single variance parameter (so a single blob-width on the 2D sky) but it uses 3 location parameters to define a 3D unit vector pointing towards the centre of the blob. I think we have the decoder output 3 numbers for the location and then we either normalise it to be a unit vector or it normalises it inside the Fisher Von Mises function itself. Does this make sense?

Sep 15 '20 16:09 chrismessenger

vitamin_c vitamin_c copied to clipboard

Cannot run on custom data

vitamin_c
vitamin_c copied to clipboard