squeezeDet
squeezeDet copied to clipboard
Minimum graph for inference
Hello. Thank you for your great work. In nets/SqueezeDet.py we have:
self._add_forward_graph()
self._add_interpretation_graph()
self._add_loss_graph()
self._add_train_graph()
self._add_viz_graph()
What is the purpose of each of these graphs, and which is the minimum needed for inference? I am guessing it is the forward_graph. In this case, what would the names of the output nodes be? In https://github.com/BichenWuUCB/squeezeDet/issues/35, @Lisandro79 uses the following output nodes:
"bbox/trimming/bbox:0",
"probability/score:0",
"probability/class_idx:0"
These are from the interpretation_graph, not the forward_graph. I am making my own custom inference scripts. The speed of the model is not as fast as I would like following the procedure outlined in #35; SqueezeDet is slower than some much larger models I've tested. I think this is due to that I am not saving the correct and/or minimal graph for inference.
I am getting 550-600ms inference speeds (this does not include NMS time) on 512x512 realtime input (i.e., batch size of 1) on an NVIDIA TX2.
[UPDATE:] My questions about forward_graph and interpretation_graph still stand, but I have since been able to get much better performance speeds by modifying my inference script. The following metrics were taken for an input resolution of 512x512, so they should not be directly compared with the paper's results in terms of inference speed.
With default TF (roughly 18fps):
Took 0.0535703890491277 secs to perform forward pass
Took 0.024277875083498657 secs to perform NMS
And, with XLA JIT compilation (roughly 30fps):
Took 0.03044898994266987 secs to perform forward pass
Took 0.03193185699637979 secs to perform NMS
Thanks for your question and discovery about the JIT compilation.
self._add_forward_graph() self._add_interpretation_graph() self._add_loss_graph() self._add_train_graph() self._add_viz_graph()
Not all of them are related to inference. The minimal operations you need is defined in forward and interpretation graphs. You can try to delete others for inference speed purpose, but my understanding is that it should not matter. The inference output does not depend on those operations, they should not have been evaluated during inference.
Besides, you can also try to trim the graph using this function.
Last, NMS is not properly optimized. So if anyone could contribute or point us to an optimized NMS implementation, that would be great.
@villanuevab Hi,villanuevab. 30 Fps is really amazing to me! Would you please share something about how you modified the squeezeDet ? (code or tutorial) I will be really greatful about that.
Best
(update)
I follow the tutorial given by tensorflow, and modify some lines in demo.py
from
with tf.Session(config= tf.ConfigProto(allow_soft_placement=True)) as sess:
to
config = tf.ConfigProto(allow_soft_placement=True)
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
with tf.Session(config=config) as sess:
and continuously detect 4 .png images and record the spending time info. then I get : Total time: 1.4299, detection time: 1.0560, filter time: 0.0094 Image detection output saved to ./data/out/out_000002.png Total time: 1.5537, detection time: 0.0169, filter time: 0.0148 Image detection output saved to ./data/out/out_000003.png Total time: 1.6881, detection time: 0.0158, filter time: 0.0105 Image detection output saved to ./data/out/out_000001.png Total time: 1.8189, detection time: 0.0153, filter time: 0.0160 Image detection output saved to ./data/out/out_000004.png
but the xls jit code seems not working, the spending time looks very close no matter weather I am opening the jit or not. NEED HELP!
I am using a 4 TITANX server. and the following infomation during tranning are pested below. ubuntu@ubuntu-Super-Server:~/LSJ_New/squeezeDet$ python ./src/demo.py /gpu:1 2018-03-19 21:16:17.094741: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094794: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094816: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094822: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.094827: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations. 2018-03-19 21:16:17.435892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:02:00.0 Total memory: 11.90GiB Free memory: 11.25GiB 2018-03-19 21:16:17.741368: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b76ad0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:17.742449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 1 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:03:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.055598: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b7aff0 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:18.056689: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 2 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:82:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.395128: W tensorflow/stream_executor/cuda/cuda_driver.cc:523] A non-primary context 0x4b7f540 exists before initializing the StreamExecutor. We haven't verified StreamExecutor works with that. 2018-03-19 21:16:18.396372: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 3 with properties: name: TITAN X (Pascal) major: 6 minor: 1 memoryClockRate (GHz) 1.531 pciBusID 0000:83:00.0 Total memory: 11.90GiB Free memory: 11.75GiB 2018-03-19 21:16:18.397531: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 0 and 2 2018-03-19 21:16:18.397561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 0 and 3 2018-03-19 21:16:18.397611: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 1 and 2 2018-03-19 21:16:18.397644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 1 and 3 2018-03-19 21:16:18.397677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 2 and 0 2018-03-19 21:16:18.397706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 2 and 1 2018-03-19 21:16:18.398768: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 3 and 0 2018-03-19 21:16:18.398805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:847] Peer access not supported between device ordinals 3 and 1 2018-03-19 21:16:18.398924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 1 2 3 2018-03-19 21:16:18.398937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y Y N N 2018-03-19 21:16:18.398945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 1: Y Y N N 2018-03-19 21:16:18.398953: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 2: N N Y Y 2018-03-19 21:16:18.398972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 3: N N Y Y 2018-03-19 21:16:18.398986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: TITAN X (Pascal), pci bus id: 0000:02:00.0) 2018-03-19 21:16:18.398997: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:1) -> (device: 1, name: TITAN X (Pascal), pci bus id: 0000:03:00.0) 2018-03-19 21:16:18.399006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:2) -> (device: 2, name: TITAN X (Pascal), pci bus id: 0000:82:00.0) 2018-03-19 21:16:18.399014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:3) -> (device: 3, name: TITAN X (Pascal), pci bus id: 0000:83:00.0) Total time: 1.4299, detection time: 1.0560, filter time: 0.0094 Image detection output saved to ./data/out/out_000002.png Total time: 1.5537, detection time: 0.0169, filter time: 0.0148 Image detection output saved to ./data/out/out_000003.png Total time: 1.6881, detection time: 0.0158, filter time: 0.0105 Image detection output saved to ./data/out/out_000001.png Total time: 1.8189, detection time: 0.0153, filter time: 0.0160 Image detection output saved to ./data/out/out_000004.png
(update) Seems I haven't compile tenserflow from source and open the jit support. I follow the official guide and install tenserflow 1.0.0, and luckly it works. I see the speed up from about 0.026s to 0.017s in one Titanx GPU
BTW, I found that. hope can help someone. https://github.com/tensorflow/tensorflow/issues/11730
done.