Training on CPU: "invalid on input"
Hi, I'm trying to run the parity experiment locally on my CPU:
python exps/parity.py --seq=20
But at Epoch 18 I get the error invalid on input:
Epoch 0 Test Loss 1.3451 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.53it/s]
TESTING SET RESULTS: Average loss: 1.3595 Err: 0.5100
Epoch 1 Train Loss 0.6839 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:05<00:00, 17.56it/s]
Epoch 1 Test Loss 0.7007 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 5.67it/s]
TESTING SET RESULTS: Average loss: 0.7005 Err: 0.5100
Epoch 2 Train Loss 0.6832 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:04<00:00, 18.22it/s]
Epoch 2 Test Loss 0.7004 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 5.39it/s]
TESTING SET RESULTS: Average loss: 0.6999 Err: 0.5100
Epoch 3 Train Loss 0.6813 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:05<00:00, 17.29it/s]
Epoch 3 Test Loss 0.7003 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.58it/s]
TESTING SET RESULTS: Average loss: 0.6994 Err: 0.5100
Epoch 4 Train Loss 0.6814 Err: 0.3900: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.00it/s]
Epoch 4 Test Loss 0.7008 Err: 0.5120: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.53it/s]
TESTING SET RESULTS: Average loss: 0.7012 Err: 0.5140
Epoch 5 Train Loss 0.6823 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.10it/s]
Epoch 5 Test Loss 0.6999 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.49it/s]
TESTING SET RESULTS: Average loss: 0.6996 Err: 0.5100
Epoch 6 Train Loss 0.6763 Err: 0.3800: 100%|██████████████████████████████| 90/90 [00:06<00:00, 14.06it/s]
Epoch 6 Test Loss 0.7024 Err: 0.5120: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.52it/s]
TESTING SET RESULTS: Average loss: 0.7028 Err: 0.5130
Epoch 7 Train Loss 0.6836 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.74it/s]
Epoch 7 Test Loss 0.6986 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.26it/s]
TESTING SET RESULTS: Average loss: 0.6986 Err: 0.5100
Epoch 8 Train Loss 0.6854 Err: 0.4100: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.82it/s]
Epoch 8 Test Loss 0.6983 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.53it/s]
TESTING SET RESULTS: Average loss: 0.6979 Err: 0.5100
Epoch 9 Train Loss 0.6882 Err: 0.4500: 100%|██████████████████████████████| 90/90 [00:06<00:00, 13.97it/s]
Epoch 9 Test Loss 0.6986 Err: 0.5060: 100%|████████████████████████████████| 2/2 [00:00<00:00, 4.12it/s]
TESTING SET RESULTS: Average loss: 0.6974 Err: 0.5100
Epoch 10 Train Loss 0.6878 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.93it/s]
Epoch 10 Test Loss 0.6985 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.57it/s]
TESTING SET RESULTS: Average loss: 0.6970 Err: 0.5100
Epoch 11 Train Loss 0.6875 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 14.09it/s]
Epoch 11 Test Loss 0.6981 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.25it/s]
TESTING SET RESULTS: Average loss: 0.6974 Err: 0.5100
Epoch 12 Train Loss 0.6830 Err: 0.3900: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.90it/s]
Epoch 12 Test Loss 0.6983 Err: 0.5120: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.35it/s]
TESTING SET RESULTS: Average loss: 0.6988 Err: 0.5130
Epoch 13 Train Loss 0.6857 Err: 0.4100: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.26it/s]
Epoch 13 Test Loss 0.6980 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.44it/s]
TESTING SET RESULTS: Average loss: 0.6977 Err: 0.5100
Epoch 14 Train Loss 0.6796 Err: 0.4500: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.64it/s]
Epoch 14 Test Loss 0.6982 Err: 0.4860: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.52it/s]
TESTING SET RESULTS: Average loss: 0.6989 Err: 0.5030
Epoch 15 Train Loss 0.6886 Err: 0.4800: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.89it/s]
Epoch 15 Test Loss 0.6974 Err: 0.5060: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.44it/s]
TESTING SET RESULTS: Average loss: 0.6960 Err: 0.5100
Epoch 16 Train Loss 0.6856 Err: 0.4500: 100%|█████████████████████████████| 90/90 [00:06<00:00, 13.99it/s]
Epoch 16 Test Loss 0.6979 Err: 0.5080: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.22it/s]
TESTING SET RESULTS: Average loss: 0.6997 Err: 0.5080
Epoch 17 Train Loss 0.6784 Err: 0.3800: 100%|█████████████████████████████| 90/90 [00:06<00:00, 14.12it/s]
Epoch 17 Test Loss 0.7000 Err: 0.5120: 100%|███████████████████████████████| 2/2 [00:00<00:00, 4.32it/s]
TESTING SET RESULTS: Average loss: 0.7011 Err: 0.5130
Epoch 18 Train Loss 0.0310 Err: 0.0000: 49%|██████████████▏ | 44/90 [00:03<00:04, 11.23it/s]invalid on input
invalid on input
invalid on input
invalid on input
invalid on input
invalid on input
Epoch 18 Train Loss 0.0234 Err: 0.0000: 51%|██████████████▊ | 46/90 [00:03<00:03, 11.16it/s]invalid on input
invalid on input
invalid on input
[...]
What could be wrong?
Sorry for the delayed reply. The "invalid on input" warning (satnet_cpp:194) means that there are Nan or Inf in the gradient, which didn't happen during our tests. Could you describe your environment (CPU spec, numpy/pytorch version) for generating the bug?
No worries, sorry for my late reply now :)
Didn't find the time yet to try again. I'll report back when I do.
So I found the time to try again. Still the same problem, but at a later epoch.
Manjaro Linux, Linux 5.3.18-1 (Running in Virtualbox) CPU: Intel i7-8550U Python 3.8.1 numpy 1.18.0 torch 1.3.1
Tell me if you need more information.
Thanks for your help!
Sorry for the late update. I've updated the APIs to work with Pytorch:1.7.0. Also, I fixed the bug on the CPU version. May you confirm that it also works on your side?
Thank you for the update. I'll report back, when I try again.