flowpp icon indicating copy to clipboard operation
flowpp copied to clipboard

Error in a training process

Open Hramchenko opened this issue 5 years ago • 0 comments

Hello. I'm trying to train flowpp++ on my own data (32x32x3 images like in CIFAR10), but the process aborted every 1-2 epochs with an error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
val_bpd=4.84827 val_inverr=2.76096 num_val_examples=0001912
iter=0011700 epoch=1.09920 bpd=4.90475 gnorm=7888.22754 lr=0.00030 fps=19.27829 sps=2.40979
iter=0011800 epoch=1.20253 bpd=4.90307 gnorm=8177.17480 lr=0.00030 fps=19.20983 sps=2.40123
Traceback (most recent call last):
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Input is not invertible.
         [[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
         [[global_norm/global_norm/_43351]]
  (1) Invalid argument: Input is not invertible.
         [[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
0 successful operations.
0 derived errors ignored.

and results are not look like a training sample even after 10k iterations:

individualImage

sample images:

Screenshot_20200115_122725_S

Do you know the cause of the problem? Is there any way to fix this error? P.S. I have only one GPU, so I start a program with mpiexec -n 1 python run_cifar.py --checkpoint=.... Train parameters: init_bs=16, total_bs=8. Thanks.

Hramchenko avatar Jan 15 '20 07:01 Hramchenko