tensorflow-tbcnn icon indicating copy to clipboard operation
tensorflow-tbcnn copied to clipboard

why training hangs

Open figurine2018 opened this issue 7 years ago • 3 comments

@Aetf I created the relevant environment and run embedding.py on my own computer according to your documentation. The program hung after it run and printed 1-25 pieces of information (the position of the stall was different each time the program was run), but it did not exit.

2018-04-01 06:01:12.024821: myglobal 1 epoch 1 step 1 loss = 21.25 (0.9 samples/sec; 1.175 sec/batch) 2018-04-01 06:01:12.354372: myglobal 2 epoch 1 step 2 loss = 17.27 (3.2 samples/sec; 0.312 sec/batch) 2018-04-01 06:01:12.787619: myglobal 3 epoch 1 step 3 loss = 10.45 (2.9 samples/sec; 0.346 sec/batch) 2018-04-01 06:01:13.477380: myglobal 4 epoch 1 step 4 loss = 17.19 (1.5 samples/sec; 0.678 sec/batch) 2018-04-01 06:01:14.020272: myglobal 5 epoch 1 step 5 loss = 17.10 (1.9 samples/sec; 0.518 sec/batch) 2018-04-01 06:01:14.258575: myglobal 6 epoch 1 step 6 loss = 10.39 (4.4 samples/sec; 0.228 sec/batch) 2018-04-01 06:01:14.698754: myglobal 7 epoch 1 step 7 loss = 26.52 (2.5 samples/sec; 0.407 sec/batch) 2018-04-01 06:01:14.965694: myglobal 8 epoch 1 step 8 loss = 15.85 (4.1 samples/sec; 0.246 sec/batch) 2018-04-01 06:01:15.259785: myglobal 9 epoch 1 step 9 loss = 17.02 (3.6 samples/sec; 0.274 sec/batch) <------it hangs and do nothing forever and different position in next rerunning

Ctrl+c does not work, and ctrl+z can exit. I used the "top" command to see that the host's CPU and memory were idle and not busy running any more.

my system is Ubuntu16.04 LTS, tensorflow=1.0.0, tensorflow_fold_fold=0.0.1 python=3.5, CPU only

Linux ubuntu 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

How do i solve this problem? Thanks very much!

figurine2018 avatar Apr 01 '18 13:04 figurine2018

Hmm, I can't think of any particular reason that could cause this problem. What was the exact command you used to run it? Also, could you run pytest in the top level folder and see if all the tests pass?

Aetf avatar Apr 01 '18 17:04 Aetf

The command which I used is python embedding.py for the result above. Is these codes never stuck on your computer?

I found that both the test_embedding.py file and the test_tbcnn.py file wrote test code according to unittest (for example, unittest.main() and class TestEmbedding(unittest.TestCase):). If I use the pytest command in the root directory, a series of errors may be generated (this is indeed the case).

figurine2018 avatar Apr 03 '18 16:04 figurine2018

I modify the default value of argument word_dim in tbcnn/config.py from 100->400, then it can run.

# parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=100)
parser.add_argument('--word_dim', help='dimension of node feature', type=int, default=40)

shiyy123 avatar Sep 09 '18 08:09 shiyy123