super-gradients
super-gradients copied to clipboard
Iterating over predictions is very slow
Describe the bug
Code listed on https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/ModelPredictions.md#access-detection-results is very slow
To Reproduce
Use pre-trained model to predict, loop over predictions
predict FPS: 30.321 iterate FPS: 5.738
I have the same bug with the following code:
def capture():
cam = dxcam.create()
frame_counter = 0
start_time = time.perf_counter()
cam.start(target_fps=144)
while True:
image = cam.get_latest_frame()
if image is not None: # New frame
prediction = model.predict(image)
bboxes, confidence, labels, class_names = get_prediction_info(prediction)
#draw_boxes(image, bboxes)
frame_counter += 1
else :
print("No frame")
print("FPS: ", frame_counter / (time.perf_counter() - start_time))
def load_model():
weights_path = './checkpoints/ckpt_best.pth'
device = 'cuda' if torch.cuda.is_available() else "cpu"
model = models.get(Models.YOLO_NAS_S, checkpoint_path=weights_path, num_classes=2)
model = model.to(device)
utils.print_info("Selected device: " + device)
return model
def get_prediction_info(predictions):
for image_prediction in predictions:
class_names = image_prediction.class_names
labels = image_prediction.prediction.labels
confidence = image_prediction.prediction.confidence
bboxes = image_prediction.prediction.bboxes_xyxy
return bboxes, confidence, labels, class_names
if __name__ == '__main__':
model = load_model()
capture()
With this code I have 2fps but when I comment the line bboxes, confidence, labels, class_names = get_prediction_info(prediction)
in the capture()
function I have 60 fps.
It seems like there's a lot of overhead in the model.predict()
function. I was able to pull together this function based on the predict code to speed up inference for single frames. You can probably get better speed if you write them yourself from scratch, but I was able to go from 15 fps to 20 fps.
NOTE: Model must be in eval()
mode or this function will throw errors.
from super_gradients.training.pipelines.pipelines import DetectionPipeline
# make sure to set IOU and confidence in the pipeline constructor
pipeline = DetectionPipeline(
model=model,
image_processor=model._image_processor,
post_prediction_callback=model.get_post_prediction_callback(iou=0.25, conf=0.30),
class_names=model._class_names,
)
def get_prediction(image_in, pipeline, model, device):
''' Obtains DetectionPrediction object via pipeline from a single input RGB image '''
# Preprocess
preprocessed_image, processing_metadata = pipeline.image_processor.preprocess_image(image=image_in.copy())
# Predict
with torch.no_grad():
torch_input = torch.Tensor(preprocessed_image).unsqueeze(0).to(device)
model_output = model(torch_input)
prediction = pipeline._decode_model_output(model_output, model_input=torch_input)
# Postprocess
return pipeline.image_processor.postprocess_predictions(predictions=prediction[0], metadata=processing_metadata)
Hi @olyashok , @Naofel-eal , @itberrios , I am not fully sure which performance you are comparing to. It seems like you are referring to experiment 1 (see below) but as you can see, iterating doesnt take time at all, it is just going over an already computed list of objects.
Experiment 1
import time
import torch
from super_gradients.common.object_names import Models
from super_gradients.training import models
# Note that currently only YoloX, PPYoloE and YOLO-NAS are supported.
model = models.get(Models.YOLO_NAS_L, pretrained_weights="coco")
# We want to use cuda if available to speed up inference.
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
start_predict = time.perf_counter()
images_predictions = model.predict(
"../../../../documentation/source/images/examples/pose_elephant_flip.gif",
)
print("Predict: ", time.perf_counter() - start_predict)
start_iterate = time.perf_counter()
for image_prediction in images_predictions:
class_names = image_prediction.class_names
labels = image_prediction.prediction.labels
confidence = image_prediction.prediction.confidence
bboxes = image_prediction.prediction.bboxes_xyxy
for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
pass
print("Iterate: ", time.perf_counter() - start_iterate)
print(f"Over: {len(images_predictions)} predictions")
Predict: 3.76509415009059
Iterate: 0.00015927606727927923
Over: 77 predictions
Experiment 2
Predicting image by image
import time
import torch
from super_gradients.common.object_names import Models
from super_gradients.training import models
from super_gradients.training.utils.media.video import load_video
# Note that currently only YoloX, PPYoloE and YOLO-NAS are supported.
model = models.get(Models.YOLO_NAS_L, pretrained_weights="coco")
# We want to use cuda if available to speed up inference.
model = model.to("cuda" if torch.cuda.is_available() else "cpu")
frames, _fps = load_video("../../../../documentation/source/images/examples/pose_elephant_flip.gif")
start = time.perf_counter()
for frame in frames:
image_prediction = model.predict(frame)[0] # Running on batch = 1, so taking the first prediction.
class_names = image_prediction.class_names
labels = image_prediction.prediction.labels
confidence = image_prediction.prediction.confidence
bboxes = image_prediction.prediction.bboxes_xyxy
for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
pass
print("Iterate: ", time.perf_counter() - start)
print(f"Over: {len(frames)} predictions")
Iterate: 5.154466142994352
Over: 77 predictions
In case you were running something similar to this experiment, it makes sense it is slower since images/frames are processed on at a time.
I would really appreciate it if you can provide a minimalistic benchmark snippet to help me understand what you are referring to exactly, and improve the predict implementation accordingly :)
In my case with a pre-trained model (average_model.pth
):
model = models.get('yolo_nas_s', num_classes=1, checkpoint_path=average_model.pth)
start_predict = time.perf_counter()
images_predictions = model.predict(path_image, iou=model_iou, conf=model_conf)
print("Predict: ", time.perf_counter() - start_predict)
def decode_image_predictions(images_predictions):
all_preds = []
start_iterate = time.perf_counter()
for image_prediction in images_predictions:
class_names = image_prediction.class_names
labels = image_prediction.prediction.labels
confidence = image_prediction.prediction.confidence
bboxes = image_prediction.prediction.bboxes_xyxy
# for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
# all_preds.append(....)
print("Iterate: ", time.perf_counter() - start_iterate)
print(f"Over: {len(images_predictions)} predictions")
return all_preds
Predict: 0.11403741594403982 Iterate: 7.749675393104553 TypeError: object of type 'generator' has no len()
Update: this problem seems to happen when running the code after installing the package (super-gradients-3.1.1
). However, running the code using the sources from the repository, it is OK.
Hi @albertofernandezvillan ,
You just raised a very good point that didn't cross my mind. In 3.1.1
, the predictions were not properly cast as a list , and were instead stored as generator. (see issue https://github.com/Deci-AI/super-gradients/issues/956)
Iterating over the predictions would actually trigger the prediction processing, which is why in that case:
- Predict is fast
- Iterating is slow
- you get
TypeError: object of type 'generator' has no len()
when callinglen(images_predictions)
But as you mentioned, we fixed it in the repository. So if you clone it or install SG with pip install git+https://github.com/Deci-AI/super-gradients
the processing will be done when calling model.predict()
like in my example.
You will also be able to properly index the predictions (i.e. do things like images_predictions[1]
for instance) or checking the len with len(images_predictions)
.
This change will be in the next release.
Great!
On Thu, May 11, 2023 at 1:14 PM Louis-Dupont @.***> wrote:
Hi @albertofernandezvillan https://github.com/albertofernandezvillan , You just raised a very good point that didn't cross my mind. In 3.1.1, the predictions were not properly cast as a list (see issue #956 https://github.com/Deci-AI/super-gradients/issues/956), and were instead stored as generator. Iterating over the predictions would actually trigger the prediction processing, which is why:
- Predict is fast
- Iterating is slow
- you get TypeError: object of type 'generator' has no len()
But as you mentioned, we fixed it in the repository. So if you clone it or install SG with pip install git+https://github.com/Deci-AI/super-gradients the processing will be done when calling model.predict() like in my example. You will also be able to properly index the predictions (i.e. do things like images_predictions[1] for instance)
This change will be in the next release.
— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/958#issuecomment-1544382459, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMBUT7RW63DFCJWECCTDYODXFUM6VANCNFSM6AAAAAAX3QF7EI . You are receiving this because you were mentioned.Message ID: @.***>
Hi, just un update. Now, prediction seem slow. Please check with two different PCs (CPU only, not GPU):
conda create --name test-super-grads python=3.10.11
pip install git+https://github.com/Deci-AI/super-gradients
def test_inference():
path_model = "./average_model.pth"
model = models.get('yolo_nas_s', num_classes=1, checkpoint_path=path_model)
path_images = "./test_images"
list_images = os.listdir(path_images)
for img_name in list_images:
path_image = os.path.join(path_images, img_name)
start_predict = time.perf_counter()
images_predictions = model.predict(path_image, iou=0.5, conf=0.3)
print("Predict: ", time.perf_counter() - start_predict)
# images_predictions.show()
start_iterate = time.perf_counter()
images_predictions = list(images_predictions)
for image_prediction in images_predictions:
class_names = image_prediction.class_names
labels = image_prediction.prediction.labels
confidence = image_prediction.prediction.confidence
bboxes = image_prediction.prediction.bboxes_xyxy
# for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
# print(label, conf, bbox)
print("Iterate: ", time.perf_counter() - start_iterate)
print(f"Over: {len(images_predictions)} predictions")
Results:
PC1: 11TH GEN Intel(R) Core(TM) u7-11850H @ 2.50GHZ WITH 16GB RAM Predict: 1.3614261000184342 Iterate: 0.0001700000138953328 Over: 1 predictions Predict: 1.3357476000091992 Iterate: 0.0001928000128827989 .....
PC2: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHZ with 16GB RAM Predict: 7.8171619940549135 Iterate: 0.0002834419719874859 Over: 1 predictions Predict: 7.68795962119475 Iterate: 0.00028508109971880913 Over: 1 predictions ....
Are the predictions times too slow, or just OK for CPU inference?
Update:
Changing the loaded model line for example to this one:
model = models.get("yolo_nas_s", pretrained_weights="coco", checkpoint_path="./yolo_nas_s_coco.pth")
Inference is faster: PC1: 11TH GEN Intel(R) Core(TM) u7-11850H @ 2.50GHZ WITH 16GB RAM Predict: 0.376152400043793 Iterate: 0.00014130002819001675 Over: 1 predictions
Based on this, I tried setting torch.set_flush_denormal(True)
and now inference is faster.
In PC1 the same as with the pretrained weights in coco, and in PC2 lowered to:
Predict: 1.4609976662322879
Iterate: 0.00031989580020308495
Over: 1 predictions
Predict: 1.4408147241920233
Iterate: 0.0002572271041572094
So in summary, setting torch.set_flush_denormal(True)
seems to help
On my side, the prediction is still very slow. I use the super-gradients cloned from Github with a RTX 2060. I made a test on a model generated from a custom dataset and the first prediction tooks 3 seconds but the following 0.1s. Here is my code:
if __name__ == '__main__':
# Initialize services
logger = Logger()
configuration = ConfigLoader(logger).load()
object_detection_service = ObjectDetectionService(logger=logger)
capture_service = CaptureService(configuration['capture'], logger=logger)
frame_renderer_service = FrameRendererService(display_fps=False, logger=logger)
for i in range(5):
frame = capture_service.capture()
start_time = time.perf_counter_ns()
prediction = object_detection_service.process_frame(frame)._images_prediction_lst[0]
end_time = time.perf_counter_ns()
print("Time: ", (end_time - start_time) / 1000000000)
The ObjectDetectionService methods are implemented as follows:
def __init__(self, checkpoint_path='./checkpoints/ckpt_best.pth', logger=None) -> None:
self.logger = Logger() if logger is None else logger
self.checkpoint_path = checkpoint_path
self.device = 'cuda' if cuda.is_available() else "cpu"
self.model = self.load_model()
self.logger.info(f"Successfully loaded model.")
self.logger.info(f"Selected device: {self.device.upper()}.")
self.logger.info(f"Object detection service initialized.")
def load_model(self):
"""
Load the YOLO model and return it.
"""
set_flush_denormal(True)
self.logger.info(f"Loading model...")
model = models.get(Models.YOLO_NAS_S, checkpoint_path=self.checkpoint_path, num_classes=2)
model = model.to(self.device)
return model
def process_frame(self, frame):
"""
Run object detection using the provided model.
"""
result = self.model.predict(frame)
return result
The result of this program is the following:
Time: 3.389252
Time: 0.0991155
Time: 0.1113286
Time: 0.113765
Time: 0.1074544
@Naofel-eal, 3.4s Is it always only the first batch that takes some time?
Note: In the next release, object_detection_service.process_frame(frame)[0]
will be fixed, so you won't need to do object_detection_service.process_frame(frame)._images_prediction_lst[0]
( you can already use it with pip install git+https://github.com/Deci-AI/super-gradients
)
Yes it's only the first inference. Thank you, I will clone the repo. I converted the .pth to.onnx and then the .onnx to .engine with TensorRT to optimize the inference time. Is there any script or doc which explains how to inference an image from a .engine YOLO-NAS model ?
We just introduced the model.export()
for YoloNAS which simplifies the process of exporting the model, and also includes any required pre-post processing steps into the compiled graph for ease of use. You can check out the tutorial, it should cover your needs https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/models_export.md
It should will very soon be added to our official documentation https://docs.deci.ai/super-gradients/documentation/source/welcome.html
Exporting the model is not addressing the fact that .pt
inference is still very slow. I get an average of 60ms for YOLO_NAS_S
vs 23ms for YOLO_v8_S
. Both tested on GPU.
model = models.get(
'yolo_nas_s',
pretrained_weights="coco"
).to(device)
self.pipeline = DetectionPipeline(
model=model.eval(),
image_processor=self.model._image_processor,
post_prediction_callback=self.model.get_post_prediction_callback(iou=args.iou, conf=args.conf),
class_names=self.model._class_names,
fuse_model=True
)
im = cv2.imread('/path/to/img.jpg')
preprocessed_image, processing_metadata = self.pipeline.image_processor.preprocess_image(image=im.copy())
with torch.no_grad():
im = torch.Tensor(preprocessed_image).unsqueeze(0).to(self.device)
model_output = self.pipeline.model(im)
preds = self.pipeline._decode_model_output(model_output, model_input=im)[0]
There is still a huge computation overhead somewhere in the pipeline @Louis-Dupont
https://github.com/mikel-brostrom/yolo_tracking/discussions/1097