GazeML icon indicating copy to clipboard operation
GazeML copied to clipboard

Program Hang in training stage

Open keishatsai opened this issue 5 years ago • 5 comments

Hi, I wonder that did anyone get the same problem as I have right now. The training code works fine at the beginning, but it is stuck after saving the checkpoint. There is no error message appears. I have been waiting for hours, and it is still froze.

I0730 11:34:55.581835 18580 time_manager.py:50] 0023481> heatmaps_mse = 0.000482749, radius_mse = 2.46166e-08
I0730 11:34:57.720099 18580 time_manager.py:50] 0023497> heatmaps_mse = 0.000477729, radius_mse = 2.61259e-08
I0730 11:35:03.389537 18580 checkpoint_manager.py:86] CheckpointManager::save_all call done

It is just stuck on the last line above.

Also, situation2, it stuck at this point....."Exiting thread preprocess"

I0730 15:46:04.360245 10444 checkpoint_manager.py:86] CheckpointManager::save_all call done
I0730 15:46:04.368223 16840 data_source.py:253] Exiting thread preprocess_UnityEyes_5
I0730 15:46:04.370217 25984 data_source.py:253] Exiting thread preprocess_UnityEyes_4
I0730 15:46:04.371214 29716 data_source.py:253] Exiting thread preprocess_UnityEyes_3
I0730 15:46:04.371214 13788 data_source.py:253] Exiting thread preprocess_UnityEyes_7
I0730 15:46:04.372213 28312 data_source.py:253] Exiting thread preprocess_UnityEyes_0
I0730 15:46:04.372213 28344 data_source.py:253] Exiting thread preprocess_UnityEyes_6
I0730 15:46:04.372213 28700 data_source.py:253] Exiting thread preprocess_UnityEyes_2
I0730 15:46:04.372213  6704 data_source.py:253] Exiting thread preprocess_UnityEyes_1

I was trying to train from scratch with UnityEyes dataset, and my environment settings are as follow:

Windows10 CUDA 10.0 cuDNN 7.6 Tensorflow-gpu 1.14.0 opencv-python 4.1.0.25 python 3.6

Or do I need something dependency to run this repo? because I have problems to make elg_demo.py run, too.

keishatsai avatar Jul 30 '19 05:07 keishatsai

I am not sure if this program can run on Windows10.

But how many UnityEyes images do you use?

After you stuck after saving point, did you check if the program is still running? Did you check nvidia-sim whether the elg model is still using full gpu?

Exiting thread problem can be solved by killing process.

WuZhuoran avatar Jul 30 '19 17:07 WuZhuoran

Hi @WuZhuoran , Thanks for replying. It did use nearly full gpu (7G over 8G memory) while it hung. I made nvidia-smi to check every 5 secs, so I know.

Currently, I prepared 7524 images to train.

What do you mean by " Exiting thread problem can be solved by killing process. " ?

keishatsai avatar Jul 31 '19 01:07 keishatsai

I mean, if you found you cannot stop the process. You can use command:

kill -9 ${PROCESS_ID}

to exit the process.

WuZhuoran avatar Jul 31 '19 16:07 WuZhuoran

@WuZhuoran So did you encounter this also? Did you kill the process normally? Actually, I am not quite sure that I have finished training or not. If I kill it, and it means that I have to restart training over and over again.

keishatsai avatar Aug 02 '19 01:08 keishatsai

@WuZhuoran So did you encounter this also? Did you kill the process normally? Actually, I am not quite sure that I have finished training or not. If I kill it, and it means that I have to restart training over and over again.

@keishatsai I did encounter before. But at most time, I can stop the process normally.

WuZhuoran avatar Aug 02 '19 16:08 WuZhuoran