nebuly
nebuly copied to clipboard
Is there a way to try speedster in docker container?
Ciao Diego,
I have tried your solution in several environments but it seems it is hard to keep correctly all version of packages.
I finally could run your notebook for yolov5 in Google Colab but I can't see any improvement using my method to measure performance, which is using the following code:
import numpy as np
dummy_input = torch.randn(1, 3, 640, 640, dtype=torch.float).to(device)
# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 100
no_data_write_timings=np.zeros((repetitions,1))
#GPU-WARM-UP
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE WITHOUT DATA TRANSFERS
with torch.no_grad():
for rep in range(repetitions):
starter.record()
#dummy_input_on_device = dummy_input.to(device)
outputs = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
no_data_write_timings[rep] = curr_time
mean_no_data_write_syn = np.sum(no_data_write_timings) / repetitions
std_no_data_write_syn = np.std(no_data_write_timings)
print('Optimized model results WITHOUT data transfers:')
print('The Optimized model mean batch inference time is:' + str(mean_no_data_write_syn))
print('The Optimized model std batch inference time is:' + str(std_no_data_write_syn))
And unfortunately it has been really hard to run it locally and I still am getting several errors, for example while I try to install your library I get:
2023-01-12 14:00:14 | WARNING | Unable to install tensor_rt on this platform. The compiler will be skipped.
2023-01-12 14:00:14 | INFO | Trying to install deepsparse on the platform...
When I try to run the way your measure performance:
times = []
for _ in range(100):
st = time.time()
results = model("zidane.jpg") #imgs[0] got the same error
times.append((time.time() - st)*1000)
yolo_optimized_time = sum(times) / len(times)
print(f"Average prediction time: {yolo_optimized_time} ms")
Because I get the following error:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[30], line 4
2 for _ in range(100):
3 st = time.time()
----> 4 results = model("zidane.jpg")
5 times.append((time.time() - st)*1000)
6 yolo_optimized_time = sum(times) / len(times)
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
24 @functools.wraps(func)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
File ~/.cache/torch/hub/ultralytics_yolov5_master/models/common.py:705, in AutoShape.forward(self, ims, size, augment, profile)
702 with amp.autocast(autocast):
703 # Inference
704 with dt[1]:
--> 705 y = self.model(x, augment=augment) # forward
707 # Post-process
708 with dt[2]:
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/torch/hub/ultralytics_yolov5_master/models/common.py:515, in DetectMultiBackend.forward(self, im, augment, visualize)
512 im = im.permute(0, 2, 3, 1) # torch BCHW to numpy BHWC shape(1,320,192,3)
514 if self.pt: # PyTorch
--> 515 y = self.model(im, augment=augment, visualize=visualize) if augment or visualize else self.model(im)
516 elif self.jit: # TorchScript
517 y = self.model(im)
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
Cell In[20], line 9, in OptimizedYolo.forward(self, x, *args, **kwargs)
7 def forward(self, x, *args, **kwargs):
8 x = list(self.core(x)) # it's a tuple
----> 9 return self.head(x)
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.cache/torch/hub/ultralytics_yolov5_master/models/yolo.py:59, in Detect.forward(self, x)
57 z = [] # inference output
58 for i in range(self.nl):
---> 59 x[i] = self.m[i](x[i]) # conv
60 bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
61 x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/conv.py:463, in Conv2d.forward(self, input)
462 def forward(self, input: Tensor) -> Tensor:
--> 463 return self._conv_forward(input, self.weight, self.bias)
File ~/.virtualenvs/speedster/lib/python3.8/site-packages/torch/nn/modules/conv.py:459, in Conv2d._conv_forward(self, input, weight, bias)
455 if self.padding_mode != 'zeros':
456 return F.conv2d(F.pad(input, self._reversed_padding_repeated_twice, mode=self.padding_mode),
457 weight, bias, self.stride,
458 _pair(0), self.dilation, self.groups)
--> 459 return F.conv2d(input, weight, bias, self.stride,
460 self.padding, self.dilation, self.groups)
RuntimeError: Input type (c10::Half) and bias type (float) should be the same
So here we are several things:
- I think my method is not the best to measure performance, because I am using the method
torch.cuda.synchronize()
but in Google Colab the best model is based on TensorRT. - It has been impossible to install the TensorRT-based compiler
- Locally, the best model is Torchscript-based and it seems I cannot even run your way to measure performace.
I think the best way is to try your optimization using a container, do you have some container where we can try speedster?
Thanks in advance.
I was finally able to run it locally using the docker container from ultralytics
However I just get and optimization (according to your library) of 1.35x faster:
This optimization is not faster than the normal tensorRT provided by ultralytics:
python export.py --include engine --device 0
is there something I am missing?
Hi @hdnh2006,
thank you for the contribution. Happy to assist you and accelerate your model together.
We ran again some tests on Yolov5 in this colab and got the following results:
- Original model results WITHOUT data transfers:
- The Original model mean batch inference time is:8.265047960281372
- The Original model std batch inference time is:1.0898744956447648
- Optimized model results WITHOUT data transfers:
- The Optimized model mean batch inference time is:4.355607032775879
- The Optimized model std batch inference time is:1.2045244697740418
And regarding your first message
- We noticed that as input size you used 640, 640 while we in the optimization use 384x640. We suggest that you always use in optimization data that is very similar as a size distribution to that which will later be used in inference, since optimization comes with the shapes of the input data.
- We recommend to set the metric_drop_ths to about 0.05 to allow speedster to test more optimization techniques and further increase the latency speed-up.
About local testing:
- It would help us a lot to know the hardware you are running them on (which CPU and GPU) and the operating system. That way we can try to replicate the issue and propose solutions
- Can you share with us the snippet where you are using the
optimize_model
function as well?
Thanks @diegofiori for your cooperation.
This notebook is not comparing the optimized model after applying the following code:
class OptimizedYolo(torch.nn.Module):
def __init__(self, optimized_core, head_layer):
super().__init__()
self.core = optimized_core
self.head = head_layer
def forward(self, x, *args, **kwargs):
x = list(self.core(x)) # it's a tuple
return self.head(x)
final_core = OptimizedYolo(model_optimized, last_layer)
model.model.model = final_core
Why is like this? Maybe here is my mistake. Because I am applying my code after these lines of code:
import numpy as np
dummy_input = torch.randn(1, 3, 384, 640, dtype=torch.float).to(device)
# INIT LOGGERS
starter, ender = torch.cuda.Event(enable_timing=True), torch.cuda.Event(enable_timing=True)
repetitions = 100
no_data_write_timings=np.zeros((repetitions,1))
#GPU-WARM-UP
for _ in range(10):
_ = model(dummy_input)
# MEASURE PERFORMANCE WITHOUT DATA TRANSFERS
with torch.no_grad():
for rep in range(repetitions):
starter.record()
#dummy_input_on_device = dummy_input.to(device)
outputs = model(dummy_input)
ender.record()
# WAIT FOR GPU SYNC
torch.cuda.synchronize()
curr_time = starter.elapsed_time(ender)
no_data_write_timings[rep] = curr_time
mean_no_data_write_syn = np.sum(no_data_write_timings) / repetitions
std_no_data_write_syn = np.std(no_data_write_timings)
print('Optimized model results WITHOUT data transfers:')
print('The Optimized model mean batch inference time is:' + str(mean_no_data_write_syn))
print('The Optimized model std batch inference time is:' + str(std_no_data_write_syn))
Then, what is the final model? I mean, I want to replace this line of code: https://github.com/ultralytics/yolov5/blob/cdd804d39ff84b413bde36a84006f51769b6043b/detect.py#L98
For your optimized model, what should I put to replace it?
Tried locally with RTX2060 and these results are gotten:
TensorRT optimization by Ultralytics:
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.3865331208705902 The Optimized model by Ultralytics std batch inference time is:0.011859980193208422
Nebullvm optimization:
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.081862072944641 The Optimized model std batch inference time is:0.14880756351818578
😞😞😞😞😞
I tried again with your modified Google Colab notebook and it is true that is 2x faster than the original PyTorch version but it is slower than the TensorRT provided by Ultralytics:
PyTorch model yolov5s
Original model results WITHOUT data transfers: The Original model mean batch inference time is:7.7102195215225215 The Original model std batch inference time is:0.967732421795685
Nebullvm optimization:
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.6612223982810974 The Optimized model std batch inference time is:0.11819371827359895
TensorRT by Ultralytics
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.5688025617599488 The Optimized model by Ultralytics std batch inference time is:0.03936980002624129
Check the notebook I modified here: https://colab.research.google.com/drive/1Nde0tCx28g3BTe2nxfLhTCMgIreqcvtw?usp=sharing
Hello @hdnh2006 ,
On the different performance respect to the Ultralytics implementation I think this can be due to the input your are giving to the optimize_model
function. In fact, when the metric_drop_ths
parameter is not given, speedster by default keeps the model in full precision 32 bits
. Speedster supports both fp16
and int8
precisions but you have to activate them passing the metric_drop_ths parameter to the optimize_model
function.
With fp16 precision in speedster I am getting 1.187 ms of inference time. I'm waiting for the int8 result.
Ok, so what are the values I should put to get fp16 precision?
I can see in the documentation that metric_drop_ths
is a float number:
https://github.com/nebuly-ai/nebullvm/blob/8aacdd7593746fd3cb71e6575847f028c9f6193d/apps/accelerate/speedster/speedster/api/functions.py#L86
model_optimized = optimize_model(
model=core_wrapper,
input_data=input_data,
optimization_time="unconstrained",
metric_drop_ths=0.1
)
I tried on TeslaV100 32GB and these are the results obtained:
Nebullvm optimization:
model_optimized = optimize_model(
model=core_wrapper,
input_data=input_data,
optimization_time="unconstrained",
metric_drop_ths=0.1
)
2023-01-12 19:11:22 | INFO | Running Speedster on GPU
2023-01-12 19:11:23 | WARNING | Missing Frameworks: tensorflow.
Please install them to include them in the optimization pipeline.
2023-01-12 19:11:25 | INFO | Benchmark performance of original model
2023-01-12 19:11:26 | INFO | Original model latency: 0.005212109088897705 sec/iter
2023-01-12 19:11:27 | INFO | Optimizing with PytorchBackendCompiler and q_type: None.
2023-01-12 19:11:29 | INFO | Optimized model latency: 0.0037038326263427734 sec/iter
2023-01-12 19:11:29 | INFO | Optimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:29 | WARNING | Unable to trace model with torch.fx
2023-01-12 19:11:31 | INFO | Optimized model latency: 0.0037364959716796875 sec/iter
2023-01-12 19:11:31 | INFO | Optimizing with ONNXCompiler and q_type: None.
2023-01-12 19:11:33 | INFO | Optimized model latency: 0.006529092788696289 sec/iter
2023-01-12 19:11:33 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.DYNAMIC.
2023-01-12 19:11:40 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:40 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:11:43 | INFO | Optimized model latency: 0.005489349365234375 sec/iter
2023-01-12 19:11:43 | INFO | Optimizing with ONNXCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:11:58 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:11:58 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: None.
2023-01-12 19:12:22 | INFO | Optimized model latency: 0.00531458854675293 sec/iter
2023-01-12 19:12:22 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.HALF.
2023-01-12 19:13:43 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
2023-01-12 19:13:43 | INFO | Optimizing with ONNXTensorRTCompiler and q_type: QuantizationType.STATIC.
2023-01-12 19:16:53 | WARNING | The optimized model will be discarded due to poor results obtained with the given metric.
[ Speedster results on GPU]
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃ Metric ┃ Original Model ┃ Optimized Model ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━┫
┃ backend ┃ PYTORCH ┃ TorchScript ┃
┃ latency ┃ 0.0052 sec/batch ┃ 0.0037 sec/batch ┃
┃ throughput ┃ 191.86 data/sec ┃ 269.99 data/sec ┃
┃ model size ┃ 35.18 MB ┃ 28.38 MB ┃
┃ metric drop (compute_relative_difference) ┃ ┃ 0 ┃
┃ speedup ┃ ┃ 1.41x ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━┛
Optimized model results WITHOUT data transfers: The Optimized model mean batch inference time is:3.505920317173004 The Optimized model std batch inference time is:0.1691654167164366
TensorRT optimization by Ultralytics with half precision
Optimized model by Ultralytics results WITHOUT data transfers: The Optimized model by Ultralytics mean batch inference time is:1.442447043657303 The Optimized model by Ultralytics std batch inference time is:0.05623926929897324
Definitely, there's something I am doing wrong but I am following all the steps you provide.
Anyway, as you can see in this notebook, the same happens on Tesla T4