SSD-Tensorflow icon indicating copy to clipboard operation
SSD-Tensorflow copied to clipboard

eval

Open xpandi-top opened this issue 5 years ago • 36 comments

when run eval_ssd_network.py, meets this 2018-07-24 18:48:02.126672: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:233] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.20843321784099034] AP_VOC12/mAP[0.20235189944609927]

xpandi-top avatar Jul 24 '18 11:07 xpandi-top

+1

foamliu avatar Aug 14 '18 23:08 foamliu

I started getting this message after upgrading to tensorflow 1.10.0 (from 1.8.0). However, my custom tensorflow code still runs.

prachiAeromana avatar Aug 17 '18 22:08 prachiAeromana

same problem here

HongyiDuanmu26 avatar Aug 24 '18 20:08 HongyiDuanmu26

same here at tensorflow 1.8.0

ryohachiuma avatar Aug 29 '18 12:08 ryohachiuma

so,how can we do to solve this problem?

XuDuoBiao avatar Sep 15 '18 12:09 XuDuoBiao

me too, have you solved this problem?

ZhuDaQing avatar Sep 24 '18 07:09 ZhuDaQing

Same issue when running the evaluation script, and the mAP is extremely small

kisanzxy avatar Dec 12 '18 18:12 kisanzxy

who can help us please? anybody solve this?

Leon924 avatar Mar 02 '19 10:03 Leon924

i got the same problem ,just like the result you had got . Have you solve it yet??? @xpandi-top @foamliu @prachiAeromana @HongyiDuanmu26 @kemangjaka Anyone solve it ??? Need your help. Thanks a lot!!!

Sulince avatar Mar 07 '19 08:03 Sulince

Hi, I'm using Ubuntu 16.04 and tensorflow-gpu 1.10.0 now, and I couldn't reproduce the error. The evaluation worked fine. When I got the error, I used Windows.

What is your environment?

ryohachiuma avatar Mar 07 '19 09:03 ryohachiuma

Thanks for reply! My environment :Ubuntu 18.04 + tensorflow-gpu 1.12.0 + python3.6 I changed my eval_ssd_network.py file and metrics.py file following the #321 ,and run eval_ssd_network.py successfully, but the result like this:

2019-03-07 09:54:28.724070: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[3.1303382901083712e-05] AP_VOC12/mAP[1.4904059586232956e-05]

Help me please! Maybe you can give me your eval_ssd_network.py file and metrics.py file if you don't mind. My email: [email protected] Thanks a lot!! @kemangjaka

Sulince avatar Mar 07 '19 09:03 Sulince

ok. Correct me. I also got the same warning as you. But I didn't get such a low mAP. What kind of dataset do you use for the evaluation? I think it's not a problem of optimizer error.

ryohachiuma avatar Mar 07 '19 09:03 ryohachiuma

The dataset used is VOCtest_06-Nov-2007, and the model is VGG_VOC0712_SSD_300x300_iter_120000.ckpt What is your mAP??? Are the dataset and model he same with me ? @kemangjaka

Sulince avatar Mar 07 '19 10:03 Sulince

I downloaded VOCtest_06-Nov-2007 dataset, and evaluated with the VGG_VOC0712_SSD_300x300_iter_120000.ckpt model.

So, the command I typed is the following.

python eval_ssd_network.py --eval_dir=./log_2007/ --dataset_dir=./data/ --dataset_name=pascalvoc_2007 --dataset_split_name=test --model_name=ssd_300_vgg --checkpoint_path=./checkpoints/VGG_VOC0712_SSD_300x300_iter_120000.ckpt --batch_size=1

And I got the mAP below.

AP_VOC07/mAP[0.59928033284390148]
AP_VOC12/mAP[0.60921384902021813]

Still quite low but not too low I think.

BTW, I didn't do any modifications to metrics.py

ryohachiuma avatar Mar 07 '19 11:03 ryohachiuma

The command i use are the same ,so as the dataset and the model. And the env is not a problem , could you please send your eval_ssd_network.py. file to my email so that i can have a try? @kemangjaka

Sulince avatar Mar 07 '19 11:03 Sulince

Well, I only changed flatten part, and nothing changed from the original file. https://github.com/balancap/SSD-Tensorflow/issues/321#issuecomment-469188867

Could you try with tensorflow-gpu 1.10.0?

ryohachiuma avatar Mar 07 '19 11:03 ryohachiuma

@kemangjaka @Sulince hi,have you solve the problem? I got that probelm too, and I cannot figure it out for so long. 019-03-08 22:41:45.604947: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.00010226225830017356] AP_VOC12/mAP[2.127145489434078e-05]

Leon924 avatar Mar 08 '19 14:03 Leon924

Hi, could you tell me the version of python, tensorflow, OS, and the command you typed? And also, did you modify any code from the original one?

ryohachiuma avatar Mar 08 '19 14:03 ryohachiuma

@kemangjaka just like you said, I only add flatten function, and my env is: tf 1.10-gpu, python3.6, redhat4.8.5 I think my env is ok ,because I can run the tutorial example

and this is my code:

DATASET_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/

EVAL_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_VOC2007/log_eval/

CHECKPOINT_PATH=/export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/

CUDA_VISIBLE_DEVICES=3 python /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py
--eval_dir=${EVAL_DIR}
--dataset_dir=${DATASET_DIR}
--dataset_name=pascalvoc_2007
--dataset_split_name=test
--model_name=ssd_300_vgg
--checkpoint_path=${CHECKPOINT_PATH}
--batch_size=1 \

Leon924 avatar Mar 08 '19 14:03 Leon924

@petit-ami Could you post all of your output?

The command is exactly the same as mine. I don't know how to reproduce your results...

ryohachiuma avatar Mar 08 '19 15:03 ryohachiuma

@kemangjaka here .please WARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:113: get_or_create_global_step (from tensorflow.contrib.framework.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.train.get_or_create_global_step

=========================================================================== #

SSD net parameters:

===========================================================================

{'anchor_offset': 0.5, 'anchor_ratios': [[2, 0.5], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5, 3, 0.3333333333333333], [2, 0.5], [2, 0.5]], 'anchor_size_bounds': [0.15, 0.9], 'anchor_sizes': [(21.0, 45.0), (45.0, 99.0), (99.0, 153.0), (153.0, 207.0), (207.0, 261.0), (261.0, 315.0)], 'anchor_steps': [8, 16, 32, 64, 100, 300], 'feat_layers': ['block4', 'block7', 'block8', 'block9', 'block10', 'block11'], 'feat_shapes': [(38, 38), (19, 19), (10, 10), (5, 5), (3, 3), (1, 1)], 'img_shape': (300, 300), 'no_annotation_label': 21, 'normalizations': [20, -1, -1, -1, -1, -1], 'num_classes': 21, 'prior_scaling': [0.1, 0.1, 0.2, 0.2]}

===========================================================================

Training | Evaluation dataset files:

===========================================================================

['/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_000.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_001.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_002.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_003.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_004.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_005.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_006.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_007.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_008.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_009.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_010.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_011.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_012.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_013.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_014.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_015.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_016.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_017.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_018.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_019.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_020.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_021.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_022.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_023.tfrecord', '/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2007/VOCtest_06-Nov-2007/VOCdevkit/VOC2007_tfrecord/voc_2007_test_024.tfrecord']

WARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:226: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating None INFO:tensorflow:Starting evaluation at 2019-03-08-14:29:14 INFO:tensorflow:Graph was finalized. 2019-03-08 22:29:14.520042: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-08 22:29:14.971772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 11.90GiB freeMemory: 4.26GiB 2019-03-08 22:29:14.971931: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-08 22:29:21.058607: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 22:29:21.058689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-08 22:29:21.058710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-08 22:29:21.072137: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1218 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:84:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-03-08 22:29:45.913166: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.126467: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.147991: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.211770: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.239826: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.243550: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 22:29:46.333402: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [495/4952] INFO:tensorflow:Evaluation [990/4952] INFO:tensorflow:Evaluation [1485/4952] INFO:tensorflow:Evaluation [1980/4952] INFO:tensorflow:Evaluation [2475/4952] INFO:tensorflow:Evaluation [2970/4952] INFO:tensorflow:Evaluation [3465/4952] INFO:tensorflow:Evaluation [3960/4952] INFO:tensorflow:Evaluation [4455/4952] INFO:tensorflow:Evaluation [4950/4952] INFO:tensorflow:Evaluation [4952/4952] 2019-03-08 22:41:45.604947: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.00010226225830017356] AP_VOC12/mAP[2.127145489434078e-05] INFO:tensorflow:Finished evaluation at 2019-03-08-14:43:15 Time spent : 841.545 seconds. Time spent per BATCH: 0.170 seconds.

Leon924 avatar Mar 08 '19 15:03 Leon924

I found it. In your log, it says,

INFO:tensorflow:Evaluating None

That means you cannot load trained file properly. Your evaluation is conducted with the initialized random weights network. The path of the checkpoint is correct?

@Sulince maybe your problem is exactly the same as this one.

ryohachiuma avatar Mar 08 '19 15:03 ryohachiuma

ARNING:tensorflow:From /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/eval_ssd_network.py:226: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating /export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt INFO:tensorflow:Starting evaluation at 2019-03-08-15:25:58 INFO:tensorflow:Graph was finalized. 2019-03-08 23:25:58.450231: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-08 23:25:58.889497: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:84:00.0 totalMemory: 11.90GiB freeMemory: 4.26GiB 2019-03-08 23:25:58.889608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1484] Adding visible gpu devices: 0 2019-03-08 23:26:13.981284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-08 23:26:13.981353: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2019-03-08 23:26:13.981373: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N 2019-03-08 23:26:14.010918: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1218 MB memory) -> physical GPU (device: 0, name: TITAN X (Pascal), pci bus id: 0000:84:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from /export/userhome/liqiang/liqiang/Deeplearning/SSD/ckpt/SSD_ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt/VGG_VOC0712_SSD_300x300_iter_120000.ckpt INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. 2019-03-08 23:26:38.264869: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.448399: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.469665: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.472249: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.519175: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.37GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.572346: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.575961: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.630726: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.06GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-08 23:26:38.652513: W tensorflow/core/common_runtime/bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [495/4952] INFO:tensorflow:Evaluation [990/4952] INFO:tensorflow:Evaluation [1485/4952] INFO:tensorflow:Evaluation [1980/4952] INFO:tensorflow:Evaluation [2475/4952] INFO:tensorflow:Evaluation [2970/4952] INFO:tensorflow:Evaluation [3465/4952] INFO:tensorflow:Evaluation [3960/4952] INFO:tensorflow:Evaluation [4455/4952] INFO:tensorflow:Evaluation [4950/4952] INFO:tensorflow:Evaluation [4952/4952] 2019-03-08 23:34:59.828743: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:237] Failed to run optimizer ArithmeticOptimizer, stage HoistCommonFactor. Error: Node average_precision_voc07/ArithmeticOptimizer/HoistCommonFactor_Add_AddN is missing output properties at position :0 (num_outputs=0) AP_VOC07/mAP[0.59928033284390148] AP_VOC12/mAP[0.60921384904878606] INFO:tensorflow:Finished evaluation at 2019-03-08-15:35:12 Time spent : 554.773 seconds. Time spent per BATCH: 0.112 seconds.

Thanks you!!!! that is , I need to plus another .ckpt. Thanks so much

Leon924 avatar Mar 08 '19 15:03 Leon924

I have solve this problem, the solution just as kemangjaka said ''INFO:tensorflow:Evaluating None", just change another model file ,It will be work. I think the model named VGG_VOC0712_SSD_300x300_iter_120000.ckpt in the repository has something wrong, so do not use it and find another one. @kemangjaka @petit-ami

Sulince avatar Mar 09 '19 02:03 Sulince

By the way , have you train the model successfully? when i train the model in VOC07+12 dataset ,my loss is high and shake as follows: INFO:tensorflow:Recording summary at step 62230. INFO:tensorflow:global step 62240: loss = 40.2912 (0.496 sec/step) INFO:tensorflow:global step 62250: loss = 40.6664 (0.493 sec/step) INFO:tensorflow:global step 62260: loss = 40.5154 (0.502 sec/step) INFO:tensorflow:global step 62270: loss = 23.9944 (0.487 sec/step) INFO:tensorflow:global step 62280: loss = 21.0998 (0.501 sec/step) INFO:tensorflow:global step 62290: loss = 39.5273 (0.505 sec/step) INFO:tensorflow:global step 62300: loss = 28.9741 (0.522 sec/step) INFO:tensorflow:global step 62310: loss = 33.9893 (0.504 sec/step) INFO:tensorflow:global step 62320: loss = 31.2430 (0.517 sec/step) INFO:tensorflow:global step 62330: loss = 50.1789 (0.500 sec/step) INFO:tensorflow:global step 62340: loss = 16.4918 (0.493 sec/step)

here is my paras: DATASET_DIR=/home/sulince/SSD_tensorflow/VOC0713/tfrecords/ TRAIN_DIR=/home/sulince/SSD_tensorflow/train_model/ CHECKPOINT_PATH=/home/sulince/SSD_tensorflow/checkpoints/vgg_16.ckpt

python3 /home/sulince/SSD_tensorflow/train_ssd_network.py
--train_dir=${TRAIN_DIR}
--dataset_dir=${DATASET_DIR}
--dataset_name=pascalvoc_2007
--dataset_split_name=train
--model_name=ssd_300_vgg
--checkpoint_path=${CHECKPOINT_PATH}
--checkpoint_model_scope=vgg_16
--checkpoint_exclude_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--trainable_scopes=ssd_300_vgg/conv6,ssd_300_vgg/conv7,ssd_300_vgg/block8,ssd_300_vgg/block9,ssd_300_vgg/block10,ssd_300_vgg/block11,ssd_300_vgg/block4_box,ssd_300_vgg/block7_box,ssd_300_vgg/block8_box,ssd_300_vgg/block9_box,ssd_300_vgg/block10_box,ssd_300_vgg/block11_box
--save_summaries_secs=60
--save_interval_secs=600
--weight_decay=0.0005
--optimizer=adam
--learning_rate=0.001
--learning_rate_decay_factor=0.94
--batch_size=16
@--gpu_memory_fraction=0.9 what is your loss?? @kemangjaka @petit-ami

Sulince avatar Mar 09 '19 02:03 Sulince

@Sulince I was using VGG_VOC0712_SSD_300x300_iter_120000.ckpt yesterday, it worked.

AP_VOC07/mAP[0.59928033284390148] AP_VOC12/mAP[0.60921384904878606]

and today I run another one, named VGG_VOC0712_SSD_300x300_ft_iter_120000.ckpt it also worked but have higher mAP @kemangjaka

AP_VOC07/mAP[0.74313215403145927] AP_VOC12/mAP[0.76659716498723329]

I am fine-tuning existing SSD checkpoint VGG_VOC0712_SSD_300x300_ft_iter_120000.ckpt , but loss connot converge, shaking around 100. DATASET_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-datasets/VOC2012/VOCtrainval_11-May-2012/VOCdevkit/VOC2012_tfrecord/

TRAIN_DIR=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_finetune_2012/ CHECKPOINT_PATH=/export/userhome/liqiang/liqiang/Deeplearning/SSD/log_files/log_finetune_2012/model.ckpt-40000 CUDA_VISIBLE_DEVICES=2 python /export/userhome/liqiang/liqiang/Deeplearning/SSD/SSD-Tensorflow/train_ssd_network.py
--train_dir=${TRAIN_DIR}
--dataset_dir=${DATASET_DIR}
--dataset_name=pascalvoc_2012
--dataset_split_name=train
--model_name=ssd_300_vgg
--CHECKPOINT_PATH=${CHECKPOINT_PATH}
--save_summaries_secs=60
--save_interval_secs=600
--weight_deacy=0.05
--optimizer=adam
--learning_rate=0.00000005
--batch_size=32 \

@Sulince and I remember last time I trained VGG16, ang also got similiar result as yours. I am working in solving it.

Leon924 avatar Mar 09 '19 02:03 Leon924

WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: QueueRunner.init (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/input.py:187: add_queue_runner (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. WARNING:tensorflow:From eval_ssd_network.py:231: streaming_mean (from tensorflow.contrib.metrics.python.ops.metric_ops) is deprecated and will be removed in a future version. Instructions for updating: Please switch to tf.metrics.mean INFO:tensorflow:Evaluating ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Starting evaluation at 2019-03-19-05:37:03 INFO:tensorflow:Graph was finalized. 2019-03-19 13:37:04.015866: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA 2019-03-19 13:37:04.093802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2019-03-19 13:37:04.094138: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705 pciBusID: 0000:01:00.0 totalMemory: 5.93GiB freeMemory: 5.35GiB 2019-03-19 13:37:04.094152: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0 2019-03-19 13:37:04.282793: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-03-19 13:37:04.282824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] 0 2019-03-19 13:37:04.282830: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0: N 2019-03-19 13:37:04.282990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1103] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 607 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1) INFO:tensorflow:Restoring parameters from ./aug_ckout/model.ckpt-8415 INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py:804: start_queue_runners (from tensorflow.python.training.queue_runner_impl) is deprecated and will be removed in a future version. Instructions for updating: To construct input pipelines, use the tf.data module. 2019-03-19 13:37:07.813959: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 828.12MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:07.857173: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:07.876847: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 610.31MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:07.909925: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 814.50MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:07.998780: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 550.42MiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:08.018300: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:08.042239: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:08.043062: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.07GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:08.102594: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.18GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. 2019-03-19 13:37:08.106842: W tensorflow/core/common_runtime/bfc_allocator.cc:215] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.35GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. INFO:tensorflow:Evaluation [195/1952] INFO:tensorflow:Evaluation [390/1952] INFO:tensorflow:Evaluation [585/1952] INFO:tensorflow:Evaluation [780/1952] INFO:tensorflow:Evaluation [975/1952] INFO:tensorflow:Evaluation [1170/1952] INFO:tensorflow:Evaluation [1365/1952] INFO:tensorflow:Evaluation [1560/1952] Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1292, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1277, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1367, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: Reduction axis 0 is empty in shape [0] [[{{node bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax}} = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/mul, bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/bboxes_jaccard/transpose_1/Range/start)]] [[{{node ssd_losses/cross_entropy_pos/value/_524}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3865_ssd_losses/cross_entropy_pos/value", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "eval_ssd_network.py", line 361, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "eval_ssd_network.py", line 325, in main session_config=config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/slim/python/slim/evaluation.py", line 217, in evaluate_once config=session_config) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/evaluation.py", line 212, in _evaluate_once session.run(eval_ops, feed_dict) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 671, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1148, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1239, in run raise six.reraise(*original_exc_info) File "/usr/lib/python3/dist-packages/six.py", line 686, in reraise raise value File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1224, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1296, in run run_metadata=run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/monitored_session.py", line 1076, in run return self._sess.run(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 887, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1110, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1286, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1308, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InvalidArgumentError: Reduction axis 0 is empty in shape [0] [[{{node bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax}} = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/mul, bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/bboxes_jaccard/transpose_1/Range/start)]] [[{{node ssd_losses/cross_entropy_pos/value/_524}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3865_ssd_losses/cross_entropy_pos/value", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

Caused by op 'bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax', defined at: File "eval_ssd_network.py", line 361, in tf.app.run() File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "eval_ssd_network.py", line 212, in main matching_threshold=FLAGS.matching_threshold) File "/home/jn/SSD-Tensorflow-master/tf_extended/bboxes.py", line 363, in bboxes_matching_batch matching_threshold) File "/home/jn/SSD-Tensorflow-master/tf_extended/bboxes.py", line 379, in bboxes_matching_batch infer_shape=True) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/functional_ops.py", line 460, in map_fn maximum_iterations=n) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3274, in while_loop return_same_structure) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2994, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2929, in _BuildLoop body_result = body(*packed_vars_for_body) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3243, in body = lambda i, lv: (i + 1, orig_body(*lv)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/functional_ops.py", line 449, in compute packed_fn_values = fn(packed_values) File "/home/jn/SSD-Tensorflow-master/tf_extended/bboxes.py", line 373, in matching_threshold), File "/home/jn/SSD-Tensorflow-master/tf_extended/bboxes.py", line 322, in bboxes_matching back_prop=False) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 3274, in while_loop return_same_structure) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2994, in BuildLoop pred, body, original_loop_vars, loop_vars, shape_invariants) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 2929, in _BuildLoop body_result = body(*packed_vars_for_body) File "/home/jn/SSD-Tensorflow-master/tf_extended/bboxes.py", line 296, in m_body idxmax = tf.cast(tf.argmax(jaccard, axis=0), tf.int32) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 88, in argmax return gen_math_ops.arg_max(input, axis, name=name, output_type=output_type) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 787, in arg_max name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3272, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1768, in init self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Reduction axis 0 is empty in shape [0] [[{{node bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/ArgMax}} = ArgMax[T=DT_FLOAT, Tidx=DT_INT32, output_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/mul, bboxes_matching_batch_dict/bboxes_matching_batch_4/map/while/bboxes_matching_single/while/bboxes_jaccard/transpose_1/Range/start)]] [[{{node ssd_losses/cross_entropy_pos/value/_524}} = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3865_ssd_losses/cross_entropy_pos/value", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

JiangniHIT avatar Mar 19 '19 06:03 JiangniHIT

aMP is so low?anyone can help me?thank you!!

JiangniHIT avatar Mar 19 '19 06:03 JiangniHIT

@petit-ami I saw that you were training and testing on PASCAL VOC 2012. I am training and evaluation on PASCAL VOC 2012 right now.

  1. I trained on the trainval of 2012 (17125 items) and tested on the test of 2012 (5138 items). But the mAP is about 0.038 either the training based on ssd_300_vgg.ckpt or not. Did you get a decent mAP? Could you give me some help, please?

  2. I also trained on 07+12(trainval of 2007 + trainval of 2012), and 07++12 (trainval & test of 2007 and trainval of 2012), the results are almost the same.

Sincerely

ylqi007 avatar Mar 22 '19 19:03 ylqi007

I found it. In your log, it says,

INFO:tensorflow:Evaluating None

That means you cannot load trained file properly. Your evaluation is conducted with the initialized random weights network. The path of the checkpoint is correct?

@Sulince maybe your problem is exactly the same as this one.

thanks,I solve my problem in your way

SunNYNO1 avatar Mar 29 '19 13:03 SunNYNO1