why training hangs
@Aetf I created the relevant environment and run embedding.py on my own computer according to your documentation. The program hung after it run and printed 1-25 pieces of information (the position of the stall was different each time the program was run), but it did not exit.
2018-04-01 06:01:12.024821: myglobal 1 epoch 1 step 1 loss = 21.25 (0.9 samples/sec; 1.175 sec/batch) 2018-04-01 06:01:12.354372: myglobal 2 epoch 1 step 2 loss = 17.27 (3.2 samples/sec; 0.312 sec/batch) 2018-04-01 06:01:12.787619: myglobal 3 epoch 1 step 3 loss = 10.45 (2.9 samples/sec; 0.346 sec/batch) 2018-04-01 06:01:13.477380: myglobal 4 epoch 1 step 4 loss = 17.19 (1.5 samples/sec; 0.678 sec/batch) 2018-04-01 06:01:14.020272: myglobal 5 epoch 1 step 5 loss = 17.10 (1.9 samples/sec; 0.518 sec/batch) 2018-04-01 06:01:14.258575: myglobal 6 epoch 1 step 6 loss = 10.39 (4.4 samples/sec; 0.228 sec/batch) 2018-04-01 06:01:14.698754: myglobal 7 epoch 1 step 7 loss = 26.52 (2.5 samples/sec; 0.407 sec/batch) 2018-04-01 06:01:14.965694: myglobal 8 epoch 1 step 8 loss = 15.85 (4.1 samples/sec; 0.246 sec/batch) 2018-04-01 06:01:15.259785: myglobal 9 epoch 1 step 9 loss = 17.02 (3.6 samples/sec; 0.274 sec/batch) <------it hangs and do nothing forever and different position in next rerunning
Ctrl+c does not work, and ctrl+z can exit. I used the "top" command to see that the host's CPU and memory were idle and not busy running any more.
my system is Ubuntu16.04 LTS, tensorflow=1.0.0, tensorflow_fold_fold=0.0.1 python=3.5, CPU only
Linux ubuntu 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
How do i solve this problem? Thanks very much!
Hmm, I can't think of any particular reason that could cause this problem. What was the exact command you used to run it? Also, could you run pytest in the top level folder and see if all the tests pass?
The command which I used is python embedding.py for the result above. Is these codes never stuck on your computer?
I found that both the test_embedding.py file and the test_tbcnn.py file wrote test code according to unittest (for example, unittest.main() and class TestEmbedding(unittest.TestCase):). If I use the pytest command in the root directory, a series of errors may be generated (this is indeed the case).
I modify the default value of argument word_dim in tbcnn/config.py from 100->400, then it can run.
# parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=100)
parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=40)