squeezeDet Minimum graph for inference

Hello. Thank you for your great work. In nets/SqueezeDet.py we have:

self._add_forward_graph()
self._add_interpretation_graph()
self._add_loss_graph()
self._add_train_graph()
self._add_viz_graph()

What is the purpose of each of these graphs, and which is the minimum needed for inference? I am guessing it is the forward_graph. In this case, what would the names of the output nodes be? In https://github.com/BichenWuUCB/squeezeDet/issues/35, @Lisandro79 uses the following output nodes:

"bbox/trimming/bbox:0",
"probability/score:0",
"probability/class_idx:0"

These are from the interpretation_graph, not the forward_graph. I am making my own custom inference scripts. The speed of the model is not as fast as I would like following the procedure outlined in #35; SqueezeDet is slower than some much larger models I've tested. I think this is due to that I am not saving the correct and/or minimal graph for inference.

I am getting 550-600ms inference speeds (this does not include NMS time) on 512x512 realtime input (i.e., batch size of 1) on an NVIDIA TX2.

Aug 01 '17 22:08 villanuevab

[UPDATE:] My questions about forward_graph and interpretation_graph still stand, but I have since been able to get much better performance speeds by modifying my inference script. The following metrics were taken for an input resolution of 512x512, so they should not be directly compared with the paper's results in terms of inference speed.

With default TF (roughly 18fps):

Took 0.0535703890491277 secs to perform forward pass
Took 0.024277875083498657 secs to perform NMS

And, with XLA JIT compilation (roughly 30fps):

Took 0.03044898994266987 secs to perform forward pass
Took 0.03193185699637979 secs to perform NMS

Aug 02 '17 01:08 villanuevab

Thanks for your question and discovery about the JIT compilation.

self._add_forward_graph() self._add_interpretation_graph() self._add_loss_graph() self._add_train_graph() self._add_viz_graph()

Not all of them are related to inference. The minimal operations you need is defined in forward and interpretation graphs. You can try to delete others for inference speed purpose, but my understanding is that it should not matter. The inference output does not depend on those operations, they should not have been evaluated during inference.

Besides, you can also try to trim the graph using this function.

Last, NMS is not properly optimized. So if anyone could contribute or point us to an optimized NMS implementation, that would be great.

Aug 02 '17 01:08 BichenWuUCB

@villanuevab Hi,villanuevab. 30 Fps is really amazing to me! Would you please share something about how you modified the squeezeDet ? (code or tutorial) I will be really greatful about that.

Best

Mar 18 '18 14:03 eleboss

(update) I follow the tutorial given by tensorflow, and modify some lines in demo.py from with tf.Session(config= tf.ConfigProto(allow_soft_placement=True)) as sess: to

    config = tf.ConfigProto(allow_soft_placement=True)
    config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
    with tf.Session(config=config) as sess:

and continuously detect 4 .png images and record the spending time info. then I get : Total time: 1.4299, detection time: 1.0560, filter time: 0.0094 Image detection output saved to ./data/out/out_000002.png Total time: 1.5537, detection time: 0.0169, filter time: 0.0148 Image detection output saved to ./data/out/out_000003.png Total time: 1.6881, detection time: 0.0158, filter time: 0.0105 Image detection output saved to ./data/out/out_000001.png Total time: 1.8189, detection time: 0.0153, filter time: 0.0160 Image detection output saved to ./data/out/out_000004.png

but the xls jit code seems not working, the spending time looks very close no matter weather I am opening the jit or not. NEED HELP!

I am using a 4 TITANX server. and the following infomation during tranning are pested below. ubuntu@ubuntu-Super-Server:~/LSJ_New/squeezeDet$ python ./src/demo.py /gpu:1 2018-03-19 21:16:17.094741: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094794: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094816: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094827: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.435892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:02:00.0 Total memory: 11.90GiB Free memory: 11.25GiB 2018-03-19 21:16:17.741368: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b76ad0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:17.742449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:03:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.055598: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b7aff0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:18.056689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:82:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.395128: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b7f540 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:18.396372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:83:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.397531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 0 and 2 2018-03-19 21:16:18.397561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 0 and 3 2018-03-19 21:16:18.397611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 1 and 2 2018-03-19 21:16:18.397644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 1 and 3 2018-03-19 21:16:18.397677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 2 and 0 2018-03-19 21:16:18.397706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 2 and 1 2018-03-19 21:16:18.398768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 3 and 0 2018-03-19 21:16:18.398805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 3 and 1 2018-03-19 21:16:18.398924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3 2018-03-19 21:16:18.398937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y N N 2018-03-19 21:16:18.398945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y N N 2018-03-19 21:16:18.398953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2: N N Y Y 2018-03-19 21:16:18.398972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3: N N Y Y 2018-03-19 21:16:18.398986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:02:00.0) 2018-03-19 21:16:18.398997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:03:00.0) 2018-03-19 21:16:18.399006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: TITAN X (Pascal), pci bus id: 0000:82:00.0) 2018-03-19 21:16:18.399014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: TITAN X (Pascal), pci bus id: 0000:83:00.0) Total time: 1.4299, detection time: 1.0560, filter time: 0.0094 Image detection output saved to ./data/out/out_000002.png Total time: 1.5537, detection time: 0.0169, filter time: 0.0148 Image detection output saved to ./data/out/out_000003.png Total time: 1.6881, detection time: 0.0158, filter time: 0.0105 Image detection output saved to ./data/out/out_000001.png Total time: 1.8189, detection time: 0.0153, filter time: 0.0160 Image detection output saved to ./data/out/out_000004.png

Mar 19 '18 13:03 eleboss

(update) Seems I haven't compile tenserflow from source and open the jit support. I follow the official guide and install tenserflow 1.0.0, and luckly it works. I see the speed up from about 0.026s to 0.017s in one Titanx GPU

BTW, I found that. hope can help someone. https://github.com/tensorflow/tensorflow/issues/11730

done.

Mar 20 '18 10:03 eleboss

squeezeDet squeezeDet copied to clipboard

Minimum graph for inference

squeezeDet
squeezeDet copied to clipboard