light_head_rcnn icon indicating copy to clipboard operation
light_head_rcnn copied to clipboard

Stuck when running "python3 train.py"

Open TonyTangYu opened this issue 6 years ago • 15 comments

I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is

Start data provider ipc://@dataflow-map-pipe-1c5a64ee-0 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-1 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-2 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-3 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-4 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-5 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-6 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-7 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-8 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-9 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-10 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-11 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-12 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-13 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-14 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-15 2018-11-18 19:57:23.729985: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-18 19:57:23.994393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-11-18 19:57:23.994853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 9.63GiB 2018-11-18 19:57:23.994897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/tangyu/Desktop/light_head_rcnn/data/imagenet_weights/res101.ckpt

It stucks here. I don't know what is wrong.How can I fix it?

TonyTangYu avatar Nov 18 '18 12:11 TonyTangYu

How long is it stuck there for? Do you see a percentage bar load?

karansomaiah avatar Nov 19 '18 17:11 karansomaiah

@karansomaiah In fact, it is struck for a long time. I did't see the percentage bar. I have tried several times before. Today I check which part of code is wrong. I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code and the training process is running now. But I wander whether that would cause a terrible result.

TonyTangYu avatar Nov 20 '18 03:11 TonyTangYu

@TonyTangYu That makes total sense. The interesting thing is, it works fine on my local computer but on the cluster, I face the same problem as you. Will keep you updated.

Update: It isn't giving me any errors. I had some issues with reading in the data for training in cluster but resolved it. Didn't have to comment out the respective lines.

karansomaiah avatar Nov 20 '18 14:11 karansomaiah

@karansomaiah Thanks for your response and update. Could you please tell me what are your issues and how to fix them? Thank you! Perhaps I will have the same error as you. Perhaps there are issues with loading the data.

TonyTangYu avatar Nov 21 '18 01:11 TonyTangYu

@TonyTangYu Definitely! So I was trying to read in my data directly from S3 bucket by changing the train_root_folder and pointing it to the S3 bucket but that led to it getting stuck. After getting the data locally, it started to run. Hope this helps.

karansomaiah avatar Nov 21 '18 17:11 karansomaiah

@karansomaiah Thank you very much! You help me a lot!

TonyTangYu avatar Nov 22 '18 04:11 TonyTangYu

@karansomaiah Use tf.global_variables_initializer instead. /home/liuji/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/liuji/light_head_rcnn/data/imagenet_weights/res101.ckpt

^CTraceback (most recent call last): File "train.py", line 264, in train(args) File "train.py", line 186, in train blobs_list = prefetch_data_layer.forward() File "/home/liuji/light_head_rcnn/lib/utils/dpflow/prefetching_iter.py", line 78, in forward if self.iter_next(): File "/home/liuji/light_head_rcnn/lib/utils/dpflow/prefetching_iter.py", line 65, in iter_next e.wait() File "/home/liuji/anaconda3/envs/tensorflow/lib/python3.6/threading.py", line 551, in wait signaled = self._cond.wait(timeout) File "/home/liuji/anaconda3/envs/tensorflow/lib/python3.6/threading.py", line 295, in wait waiter.acquire()

Hello, I also meet the same problem, could you give a more detail solution. Thanks!.

lji72 avatar Dec 03 '18 09:12 lji72

@karansomaiah Thank you very much! You help me a lot!

so, can you give a detail for the solution

chanajianyu avatar Dec 19 '18 03:12 chanajianyu

I finished installing light head r-cnn and preparing the data. When I run 'python3 train.py -d 0', the information on the terminal is

Start data provider ipc://@dataflow-map-pipe-1c5a64ee-0 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-1 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-2 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-3 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-4 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-5 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-6 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-7 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-8 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-9 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-10 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-11 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-12 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-13 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-14 Start data provider ipc://@dataflow-map-pipe-1c5a64ee-15 2018-11-18 19:57:23.729985: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-11-18 19:57:23.994393: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-11-18 19:57:23.994853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.683 pciBusID: 0000:01:00.0 totalMemory: 10.91GiB freeMemory: 9.63GiB 2018-11-18 19:57:23.994897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1) WARNING:tensorflow:From /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. /home/tangyu/softwares/anaconda3/envs/tensorflow-gpu=1.5/lib/python3.6/site-packages/tensorflow/python/ops/gradients_impl.py:97: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory. "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from /home/tangyu/Desktop/light_head_rcnn/data/imagenet_weights/res101.ckpt

It stucks here. I don't know what is wrong.How can I fix it?

i met the same problem, can you show a more detail solution?

chanajianyu avatar Dec 19 '18 05:12 chanajianyu

@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.

TonyTangYu avatar Dec 20 '18 02:12 TonyTangYu

@chanajianyu I don't know whether I could help you. At that time, I found that code in /lib/utils/dpflow/prefetching_iter.py might cause this problem because the 64th line and 65th line call a function named wait. I commented out these code. Maybe you can print something and track where it stuck.

however, i follow your advice, but new problem occur, it is ''raise StopIteration" def forward(self): """Get blobs and copy them into this layer's top blob vector.""" if self.iter_next(): return self.current_batch else: raise StopIteration

chanajianyu avatar Dec 20 '18 03:12 chanajianyu

Having the problem with getting stuck forever at: "Converting sparse IndexedSlices to a dense Tensor of unknown shape. " INFO:tensorflow:Restoring parameters from ~/light_head_rcnn/data/imagenet_weights/res101.ckpt The problem is with the threading function as noticed by many otehrs: light_head_rcnn/lib/utils/dpflow/prefetching_iter.py lines 83 and 69.

Did anyone resolve this? Commenting the e.wait() does not work at all, not the right solution.

Using TF 1.5.0 with a single GPU (GTX 1070) in a nvidia-docker container.

mbruchalski1 avatar Dec 23 '18 10:12 mbruchalski1

Hey guys, I ran into the same problem using custom data and I think I know what it the problem.

First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.

Hope it helped !

YellowKyu avatar Jan 28 '19 09:01 YellowKyu

Hey guys, I ran into the same problem using custom data and I think I know what it the problem.

First check the 'fpath' field in your .odgt files, then if in your config.py you set the repo as the root_dir, 'fpath' must not include the whole path (absolute path) to the picture. When I generated the .odgt I didn't pay attention to config.py nor to 'fpath' and then the script was trying to load an image with the wrong path.

Hope it helped !

perfect answer!!! I solve this problem by checking the config.py, in this file the 'train_root_folder' not have "/" in the end, so the programmer can`t find the image. Thank you!!!!

FeiWard avatar Sep 23 '19 11:09 FeiWard

Indeed, the image paths given in "fpath" is joined from behind in light_head_rcnn/experiments/lizming/lighthead[...]/dataset.py this way: os.path.join(train_root_folder, record['fpath]) where train_root_folder is specified in light_head_rcnn/experiments/lizming/lighthead[...]/config.py: train_root_folder = os.path.join(root_dir, 'data/MSCOCO') Finally, root_dir also defined in config.py as: root_dir = osp.abspath(osp.join(osp.dirname(__file__), '..', '..', '..')) Which by default would give the root of the repository (with 'light_head_rcnn' the last folder in the path).

masotrix avatar Nov 24 '19 16:11 masotrix