flowpp
flowpp copied to clipboard
Error in a training process
Hello. I'm trying to train flowpp++ on my own data (32x32x3 images like in CIFAR10), but the process aborted every 1-2 epochs with an error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
val_bpd=4.84827 val_inverr=2.76096 num_val_examples=0001912
iter=0011700 epoch=1.09920 bpd=4.90475 gnorm=7888.22754 lr=0.00030 fps=19.27829 sps=2.40979
iter=0011800 epoch=1.20253 bpd=4.90307 gnorm=8177.17480 lr=0.00030 fps=19.20983 sps=2.40123
Traceback (most recent call last):
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/user/ML/VENV_HOROVOD/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Input is not invertible.
[[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
[[global_norm/global_norm/_43351]]
(1) Invalid argument: Input is not invertible.
[[{{node gradients/Pointwise_12_1/MatrixDeterminant_grad/MatrixInverse}}]]
0 successful operations.
0 derived errors ignored.
and results are not look like a training sample even after 10k iterations:
sample images:
Do you know the cause of the problem? Is there any way to fix this error?
P.S. I have only one GPU, so I start a program with mpiexec -n 1 python run_cifar.py --checkpoint=...
. Train parameters: init_bs=16, total_bs=8
. Thanks.