glow How to solve 'Input is not Invertible error'?

I am trying to train a GLOW mapping on a custom dataset. However while training, I frequently receive a tensorflow.python.framework.errors_impl.InvalidArgumentError: Input is not invertible error. Upon seeing the logs, I see that the training/validation stats have reached either inf or nan.

I then tried to just reproduce your results for celeba 256x256 Qualitatively. However, I still face such issues. I am lost as to how to debug. I downloaded the celeba-tfr dataset locally.

Command:

python train.py --problem celeba --image_size 256 --n_level 6 --depth 32 --flow_permutation 2 --flow_coupling 0 --seed 0 --learntop --lr 0.001 --n_bits_x 5 --data_dir=./celeba-tfr --verbose --epochs_full_valid=1 --epochs_full_sample=1 --n_train=30 --n_test=30

Namespace:

Namespace(anchor_size=32, beta1=0.9, category='', dal=1, data_dir='./celeba-tfr', depth=32, direct_it
erator=True, epochs=1000000, epochs_full_sample=1, epochs_full_valid=1, epochs_warmup=10, flow_coupli
ng=0, flow_permutation=2, fmap=1, full_test_its=30, gradient_checkpointing=1, image_size=256, inferen
ce=False, learntop=True, local_batch_init=4, local_batch_test=1, local_batch_train=1, logdir='./logs'
, lr=0.001, n_batch_init=256, n_batch_test=50, n_batch_train=64, n_bins=32.0, n_bits_x=5, n_levels=6,
 n_sample=1, n_test=30, n_train=30, n_y=1, optimizer='adamax', pmap=16, polyak_epochs=1, problem='cel
eba', restore_path='', rnd_crop=False, seed=0, test_its=1, top_shape=[4, 4, 384], train_its=1, verbos
e=True, weight_decay=1.0, weight_y=0.0, width=512, ycond=False)

Trace:

Starting training. Logging to /home/ubuntu/glow_/logs/
epoch n_processed n_images ips dtrain dtest dsample dtot train_results test_results msg
0 179.9140625 [2.5411766 2.5411766 0.        1.       ]
1 64 1 0.0 179.9 88.8 177.1 445.7 [2.5411766 2.5411766 0.        1.       ] [2.7737396 2.7737396 0.
      1.       ]  *
64 5.25806736946106 [2.6743338 2.6743338 0.        1.       ]
2 128 2 0.2 5.3 36.1 161.6 203.0 [2.6743338 2.6743338 0.        1.       ] [nan nan  0.  1.]
128 4.962073087692261 [nan nan  0.  1.]
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/clie
nt/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/clie
nt/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/clie
nt/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input is not invertible.
         [[Node: model_3/1/28/invconv/MatrixInverse = MatrixInverse[T=DT_FLOAT, adjoint=false, _devic
e="/job:localhost/replica:0/task:0/device:GPU:0"](model/1/28/invconv/W/read)]]
         [[Node: model_3/5/6/f1/l_1/Shape/_79621 = _Recv[client_terminated=false, recv_device="/job:l
ocalhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0",
send_device_incarnation=1, tensor_name="edge_9830_model_3/5/6/f1/l_1/Shape", tensor_type=DT_INT32, _d
evice="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "train.py", line 414, in <module>
    main(hps)
  File "train.py", line 163, in main
    train(sess, model, hps, logdir, visualise)
  File "train.py", line 274, in train
    visualise(epoch)
  File "train.py", line 50, in draw_samples
    x_samples.append(sample_batch(y, [.0]*n_batch))
  File "train.py", line 33, in sample_batch
    y[i*n_batch:i*n_batch + n_batch], eps[i*n_batch:i*n_batch + n_batch]))
  File "/home/ubuntu/glow_/model.py", line 242, in sample
    return m.sess.run(x_sampled, {Y: _y, m.eps_std: _eps_std})
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Input is not invertible.
         [[Node: model_3/1/28/invconv/MatrixInverse = MatrixInverse[T=DT_FLOAT, adjoint=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/1/28/invconv/W/read)]]
         [[Node: model_3/5/6/f1/l_1/Shape/_79621 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9830_model_3/5/6/f1/l_1/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'model_3/1/28/invconv/MatrixInverse', defined at:
  File "train.py", line 414, in <module>
    main(hps)
  File "train.py", line 156, in main
    model = model.model(sess, hps, train_iterator, test_iterator, data_init)
  File "/home/ubuntu/glow_/model.py", line 239, in model
    x_sampled = f_sample(Y, m.eps_std)
  File "/home/ubuntu/glow_/model.py", line 232, in f_sample
    z = decoder(z, eps_std=eps_std)
  File "/home/ubuntu/glow_/model.py", line 97, in decoder
    z, _ = revnet2d(str(i), z, 0, hps, reverse=True)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/glow_/model.py", line 342, in revnet2d
    z, logdet = revnet2d_step(str(i), z, logdet, hps, reverse)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/glow_/model.py", line 411, in revnet2d_step
    "invconv", z, logdet, reverse=True)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/ubuntu/glow_/model.py", line 467, in invertible_1x1_conv
    _w = tf.matrix_inverse(w)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/ops/gen_linalg_ops.py", line 1049, in matrix_inverse
    "MatrixInverse", input=input, adjoint=adjoint, name=name)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Input is not invertible.
         [[Node: model_3/1/28/invconv/MatrixInverse = MatrixInverse[T=DT_FLOAT, adjoint=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](model/1/28/invconv/W/read)]]
         [[Node: model_3/5/6/f1/l_1/Shape/_79621 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_9830_model_3/5/6/f1/l_1/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I suspected it is because of bad learning rates which might make the kernel non-invertible, I played with low LRs, but of no help.

Aug 03 '18 07:08 arunpatro

I'm having this issue as well. I've been able to train on custom dataset without conditioning on class labels. However, if I set --ycond and weight_y 0.01, the gradients start by reducing, but eventually explode to inf, causing the error.

Aug 16 '18 18:08 nuges01

I first run the experiment on an AWS p2.xlarge EC2 instance. It gave me these errors. I re-run the experiment later on a NVIDIA Titan X. It ran smoothly. Converging on 64x64 images but ot converging for 256x256. Exact same hyperparams.

I doubt it can be solved. Its prone to bad random initialisations, that lead it to inf and NaN errors on the search space.

Aug 17 '18 14:08 arunpatro

How about adding "+ tf.eye(shape[3]) * 10e-4 " to this line:

https://github.com/openai/glow/blob/master/model.py#L451

? Does that make any difference?

Aug 20 '18 18:08 omidsakhi

I also experienced a similar error. I avoided the error using the following modification.

diff --git a/model.py b/model.py
index b918ab0..68cb3fe 100644
--- a/model.py
+++ b/model.py
@@ -373,7 +373,7 @@ def revnet2d_step(name, z, logdet, hps, reverse):
                 h = f("f1", z1, hps.width, n_z)
                 shift = h[:, :, :, 0::2]
                 # scale = tf.exp(h[:, :, :, 1::2])
-                scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.)
+                scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.) + 1e-10
                 z2 += shift
                 z2 *= scale
                 logdet += tf.reduce_sum(tf.log(scale), axis=[1, 2, 3])
@@ -393,7 +393,7 @@ def revnet2d_step(name, z, logdet, hps, reverse):
                 h = f("f1", z1, hps.width, n_z)
                 shift = h[:, :, :, 0::2]
                 # scale = tf.exp(h[:, :, :, 1::2])
-                scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.)
+                scale = tf.nn.sigmoid(h[:, :, :, 1::2] + 2.) + 1e-10
                 z2 /= scale
                 z2 -= shift
                 logdet -= tf.reduce_sum(tf.log(scale), axis=[1, 2, 3])
diff --git a/tfops.py b/tfops.py
index d978419..2e7c556 100644
--- a/tfops.py
+++ b/tfops.py
@@ -449,9 +449,9 @@ def gaussian_diag(mean, logsd):
     o.sample = mean + tf.exp(logsd) * o.eps
     o.sample2 = lambda eps: mean + tf.exp(logsd) * eps
     o.logps = lambda x: -0.5 * \
-        (np.log(2 * np.pi) + 2. * logsd + (x - mean) ** 2 / tf.exp(2. * logsd))
+        (np.log(2 * np.pi) + 2. * logsd + (x - mean) ** 2 / (tf.exp(2. * logsd) + 1e-10))
     o.logp = lambda x: flatten_sum(o.logps(x))
-    o.get_eps = lambda x: (x - mean) / tf.exp(logsd)
+    o.get_eps = lambda x: (x - mean) / (tf.exp(logsd) + 1e-10)
     return o

Aug 21 '18 08:08 tatsuhiko-inoue

@tatsuhiko-inoue, Thanks for the suggestion. It didn't work for me though. Those modifications are under the condition elif hps.flow_coupling == 1 (affine coupling). I'm following the set of parameters for Conditional qualitative results for which flow_coupling is 0 (additive). For example: python train.py --problem cifar10 --image_size 32 --n_level 3 --depth 32 --flow_permutation 2 --flow_coupling 0 --seed 0 --learntop --lr 0.001 --n_bits_x 5 --ycond --weight_y=0.01 I just ran it after changing --flow_coupling to 1, but it still results in the problem.

Aug 23 '18 06:08 nuges01

@omidsakhi, that didn't solve it for me either, I'm afraid. Thanks.

Aug 23 '18 15:08 nuges01

When I execute glow, the gradient of "logsd" in gaussian_diag() may be NaN. When the "logsd" is 45.0 or more, the gradient becomes NaN.

I was able to avoid NaN gradient by calculate the gradient of "x/exp(y)" collectively as follows. But instead the loss has become unstable.

@tf.custom_gradient
def div_by_exp(x, y):
    exp_y = tf.exp(y) + 1e-10
    ret = x / exp_y
    def _grad(dy):
        return dy/exp_y, dy*-ret
    return ret, _grad

def gaussian_diag(mean, logsd):
        :
    o.logps = lambda x: -0.5 * (np.log(2 * np.pi) + 2. * logsd + div_by_exp((x - mean) ** 2, 2*logsd))
        :

Sep 06 '18 09:09 tatsuhiko-inoue

I met the issue same how to solve it？@tatsuhiko-inoue @nuges01 @arunpatro

Jan 21 '19 03:01 paulchou0309

Hello guys, I found a solution for this 'not invertible' problem. During the training, the weighs of invertible 1x1 conv keeps increase to balance the log-determinant terms generated by invertible 1x1 conv and affine coupling layer/actnorm. This can be solved by adding an regularization term only for the weights of invertible 1x1 conv. In practice I use l2 regulariztion. But it's also worth mentioning that after adding regularization term, the number of epochs will slightly increase to converge to the same NLL.

I discussed it in our recent publication "Generative Model with Dynamic Linear Flow", which improves the performance of flow-based methods significantly and converges faster than Glow. Our code is here.

Feb 10 '19 04:02 naturomics

tf.eye(shape[3]) * 10e-4

I think you mean "+ tf.eye(3) * 10e-4"

shape[3] is not defined.

Oct 30 '20 08:10 AnastasisKratsios

glow glow copied to clipboard

How to solve 'Input is not Invertible error'?

glow
glow copied to clipboard