super-gradients Iterating over predictions is very slow

Describe the bug

Code listed on https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/ModelPredictions.md#access-detection-results is very slow

To Reproduce

Use pre-trained model to predict, loop over predictions

predict FPS: 30.321 iterate FPS: 5.738

May 09 '23 16:05 olyashok

Join the discussion on DagsHub!

May 09 '23 16:05 dagshub[bot]

I have the same bug with the following code:

def capture():  
    cam = dxcam.create()
    frame_counter = 0
    start_time = time.perf_counter()
    cam.start(target_fps=144)
    while True:
        image = cam.get_latest_frame()         
        if image is not None:  # New frame
            prediction = model.predict(image)
            bboxes, confidence, labels, class_names = get_prediction_info(prediction)
            #draw_boxes(image, bboxes)
            frame_counter += 1
        else :
            print("No frame")
        print("FPS: ", frame_counter / (time.perf_counter() - start_time))

def load_model():
    weights_path = './checkpoints/ckpt_best.pth'
    device = 'cuda' if torch.cuda.is_available() else "cpu"
    
    model = models.get(Models.YOLO_NAS_S, checkpoint_path=weights_path, num_classes=2)
    model = model.to(device)
    utils.print_info("Selected device: " + device)
    return model
            
def get_prediction_info(predictions):
    for image_prediction in predictions:
        class_names = image_prediction.class_names
        labels = image_prediction.prediction.labels
        confidence = image_prediction.prediction.confidence
        bboxes = image_prediction.prediction.bboxes_xyxy
    return bboxes, confidence, labels, class_names

if __name__ == '__main__':
    model = load_model()
    capture()

With this code I have 2fps but when I comment the line bboxes, confidence, labels, class_names = get_prediction_info(prediction) in the capture() function I have 60 fps.

May 09 '23 17:05 Naofel-eal

It seems like there's a lot of overhead in the model.predict() function. I was able to pull together this function based on the predict code to speed up inference for single frames. You can probably get better speed if you write them yourself from scratch, but I was able to go from 15 fps to 20 fps.

NOTE: Model must be in eval() mode or this function will throw errors.

from super_gradients.training.pipelines.pipelines import DetectionPipeline

# make sure to set IOU and confidence in the pipeline constructor
pipeline = DetectionPipeline(
            model=model,
            image_processor=model._image_processor,
            post_prediction_callback=model.get_post_prediction_callback(iou=0.25, conf=0.30),
            class_names=model._class_names,
        )


def get_prediction(image_in, pipeline, model, device):
  ''' Obtains DetectionPrediction object via pipeline from a single input RGB image '''
  # Preprocess
  preprocessed_image, processing_metadata = pipeline.image_processor.preprocess_image(image=image_in.copy())

  # Predict
  with torch.no_grad():
      torch_input = torch.Tensor(preprocessed_image).unsqueeze(0).to(device)
      model_output = model(torch_input)
      prediction = pipeline._decode_model_output(model_output, model_input=torch_input)

  # Postprocess
  return pipeline.image_processor.postprocess_predictions(predictions=prediction[0], metadata=processing_metadata)

May 11 '23 00:05 itberrios

Hi @olyashok , @Naofel-eal , @itberrios , I am not fully sure which performance you are comparing to. It seems like you are referring to experiment 1 (see below) but as you can see, iterating doesnt take time at all, it is just going over an already computed list of objects.

Experiment 1

import time

import torch
from super_gradients.common.object_names import Models
from super_gradients.training import models


# Note that currently only YoloX, PPYoloE and YOLO-NAS are supported.
model = models.get(Models.YOLO_NAS_L, pretrained_weights="coco")

# We want to use cuda if available to speed up inference.
model = model.to("cuda" if torch.cuda.is_available() else "cpu")


start_predict = time.perf_counter()
images_predictions = model.predict(
    "../../../../documentation/source/images/examples/pose_elephant_flip.gif",
)
print("Predict: ", time.perf_counter() - start_predict)


start_iterate = time.perf_counter()
for image_prediction in images_predictions:
    class_names = image_prediction.class_names
    labels = image_prediction.prediction.labels
    confidence = image_prediction.prediction.confidence
    bboxes = image_prediction.prediction.bboxes_xyxy

    for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
        pass

print("Iterate: ", time.perf_counter() - start_iterate)

print(f"Over: {len(images_predictions)} predictions")

Predict:  3.76509415009059
Iterate:  0.00015927606727927923
Over: 77 predictions

Experiment 2

Predicting image by image

import time

import torch
from super_gradients.common.object_names import Models
from super_gradients.training import models
from super_gradients.training.utils.media.video import load_video

# Note that currently only YoloX, PPYoloE and YOLO-NAS are supported.
model = models.get(Models.YOLO_NAS_L, pretrained_weights="coco")

# We want to use cuda if available to speed up inference.
model = model.to("cuda" if torch.cuda.is_available() else "cpu")

frames, _fps = load_video("../../../../documentation/source/images/examples/pose_elephant_flip.gif")


start = time.perf_counter()
for frame in frames:
    image_prediction = model.predict(frame)[0] # Running on batch = 1, so taking the first prediction.

    class_names = image_prediction.class_names
    labels = image_prediction.prediction.labels
    confidence = image_prediction.prediction.confidence
    bboxes = image_prediction.prediction.bboxes_xyxy

    for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
        pass

print("Iterate: ", time.perf_counter() - start)

print(f"Over: {len(frames)} predictions")

Iterate:  5.154466142994352
Over: 77 predictions

In case you were running something similar to this experiment, it makes sense it is slower since images/frames are processed on at a time.

I would really appreciate it if you can provide a minimalistic benchmark snippet to help me understand what you are referring to exactly, and improve the predict implementation accordingly :)

May 11 '23 13:05 Louis-Dupont

In my case with a pre-trained model (average_model.pth):

model = models.get('yolo_nas_s', num_classes=1, checkpoint_path=average_model.pth)

start_predict = time.perf_counter()
images_predictions = model.predict(path_image, iou=model_iou, conf=model_conf)
print("Predict: ", time.perf_counter() - start_predict)

def decode_image_predictions(images_predictions):
    all_preds = []
    start_iterate = time.perf_counter()
    for image_prediction in images_predictions:
        class_names = image_prediction.class_names
        labels = image_prediction.prediction.labels
        confidence = image_prediction.prediction.confidence
        bboxes = image_prediction.prediction.bboxes_xyxy

        # for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
        #     all_preds.append(....)

        
    print("Iterate: ", time.perf_counter() - start_iterate)
    print(f"Over: {len(images_predictions)} predictions")

    return all_preds

Predict: 0.11403741594403982 Iterate: 7.749675393104553 TypeError: object of type 'generator' has no len()

Update: this problem seems to happen when running the code after installing the package (super-gradients-3.1.1). However, running the code using the sources from the repository, it is OK.

May 11 '23 13:05 albertofernandezvillan

Hi @albertofernandezvillan , You just raised a very good point that didn't cross my mind. In 3.1.1, the predictions were not properly cast as a list , and were instead stored as generator. (see issue https://github.com/Deci-AI/super-gradients/issues/956) Iterating over the predictions would actually trigger the prediction processing, which is why in that case:

Predict is fast
Iterating is slow
you get TypeError: object of type 'generator' has no len() when calling len(images_predictions)

But as you mentioned, we fixed it in the repository. So if you clone it or install SG with pip install git+https://github.com/Deci-AI/super-gradients the processing will be done when calling model.predict() like in my example. You will also be able to properly index the predictions (i.e. do things like images_predictions[1] for instance) or checking the len with len(images_predictions).

This change will be in the next release.

May 11 '23 17:05 Louis-Dupont

Great!

On Thu, May 11, 2023 at 1:14 PM Louis-Dupont @.***> wrote:

Hi @albertofernandezvillan https://github.com/albertofernandezvillan , You just raised a very good point that didn't cross my mind. In 3.1.1, the predictions were not properly cast as a list (see issue #956 https://github.com/Deci-AI/super-gradients/issues/956), and were instead stored as generator. Iterating over the predictions would actually trigger the prediction processing, which is why:

Predict is fast

Iterating is slow

you get TypeError: object of type 'generator' has no len()

But as you mentioned, we fixed it in the repository. So if you clone it or install SG with pip install git+https://github.com/Deci-AI/super-gradients the processing will be done when calling model.predict() like in my example. You will also be able to properly index the predictions (i.e. do things like images_predictions[1] for instance)

This change will be in the next release.

— Reply to this email directly, view it on GitHub https://github.com/Deci-AI/super-gradients/issues/958#issuecomment-1544382459, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMBUT7RW63DFCJWECCTDYODXFUM6VANCNFSM6AAAAAAX3QF7EI . You are receiving this because you were mentioned.Message ID: @.***>

May 11 '23 20:05 olyashok

Hi, just un update. Now, prediction seem slow. Please check with two different PCs (CPU only, not GPU):

conda create --name test-super-grads python=3.10.11
pip install git+https://github.com/Deci-AI/super-gradients

def test_inference():
    path_model = "./average_model.pth"
    model = models.get('yolo_nas_s', num_classes=1, checkpoint_path=path_model)
    path_images = "./test_images"
    list_images = os.listdir(path_images)
    for img_name in list_images:
        path_image = os.path.join(path_images, img_name)

        start_predict = time.perf_counter()
        images_predictions = model.predict(path_image, iou=0.5, conf=0.3)
        print("Predict: ", time.perf_counter() - start_predict)

        # images_predictions.show()
        start_iterate = time.perf_counter()
        images_predictions = list(images_predictions)
        for image_prediction in images_predictions:
            class_names = image_prediction.class_names
            labels = image_prediction.prediction.labels
            confidence = image_prediction.prediction.confidence
            bboxes = image_prediction.prediction.bboxes_xyxy

            # for i, (label, conf, bbox) in enumerate(zip(labels, confidence, bboxes)):
            #    print(label, conf, bbox)

        print("Iterate: ", time.perf_counter() - start_iterate)
        print(f"Over: {len(images_predictions)} predictions")

Results:

PC1: 11TH GEN Intel(R) Core(TM) u7-11850H @ 2.50GHZ WITH 16GB RAM Predict: 1.3614261000184342 Iterate: 0.0001700000138953328 Over: 1 predictions Predict: 1.3357476000091992 Iterate: 0.0001928000128827989 .....

PC2: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHZ with 16GB RAM Predict: 7.8171619940549135 Iterate: 0.0002834419719874859 Over: 1 predictions Predict: 7.68795962119475 Iterate: 0.00028508109971880913 Over: 1 predictions ....

Are the predictions times too slow, or just OK for CPU inference?

Update: Changing the loaded model line for example to this one: model = models.get("yolo_nas_s", pretrained_weights="coco", checkpoint_path="./yolo_nas_s_coco.pth")

Inference is faster: PC1: 11TH GEN Intel(R) Core(TM) u7-11850H @ 2.50GHZ WITH 16GB RAM Predict: 0.376152400043793 Iterate: 0.00014130002819001675 Over: 1 predictions

Based on this, I tried setting torch.set_flush_denormal(True) and now inference is faster. In PC1 the same as with the pretrained weights in coco, and in PC2 lowered to: Predict: 1.4609976662322879 Iterate: 0.00031989580020308495 Over: 1 predictions Predict: 1.4408147241920233 Iterate: 0.0002572271041572094

So in summary, setting torch.set_flush_denormal(True) seems to help

May 12 '23 08:05 albertofernandezvillan

On my side, the prediction is still very slow. I use the super-gradients cloned from Github with a RTX 2060. I made a test on a model generated from a custom dataset and the first prediction tooks 3 seconds but the following 0.1s. Here is my code:

if __name__ == '__main__':
    # Initialize services
    logger = Logger()
    configuration = ConfigLoader(logger).load()
    object_detection_service = ObjectDetectionService(logger=logger)
    capture_service = CaptureService(configuration['capture'], logger=logger)
    frame_renderer_service = FrameRendererService(display_fps=False, logger=logger)

    for i in range(5):
        frame = capture_service.capture()
        start_time = time.perf_counter_ns()
        prediction = object_detection_service.process_frame(frame)._images_prediction_lst[0]
        end_time = time.perf_counter_ns()
        print("Time: ", (end_time - start_time) / 1000000000)

The ObjectDetectionService methods are implemented as follows:

def __init__(self, checkpoint_path='./checkpoints/ckpt_best.pth', logger=None) -> None:
        self.logger = Logger() if logger is None else logger
        self.checkpoint_path = checkpoint_path
        self.device = 'cuda' if cuda.is_available() else "cpu"
        self.model = self.load_model()
        self.logger.info(f"Successfully loaded model.")
        self.logger.info(f"Selected device: {self.device.upper()}.")
        self.logger.info(f"Object detection service initialized.")
        
    def load_model(self):
        """
        Load the YOLO model and return it.
        """
        set_flush_denormal(True)
        self.logger.info(f"Loading model...")
        model = models.get(Models.YOLO_NAS_S, checkpoint_path=self.checkpoint_path, num_classes=2)
        model = model.to(self.device)
        return model
        
       def process_frame(self, frame):
        """
        Run object detection using the provided model.
        """
        result = self.model.predict(frame)
        return result

The result of this program is the following:

Time:  3.389252
Time:  0.0991155
Time:  0.1113286
Time:  0.113765
Time:  0.1074544

May 12 '23 10:05 Naofel-eal

@Naofel-eal, 3.4s Is it always only the first batch that takes some time?

Note: In the next release, object_detection_service.process_frame(frame)[0] will be fixed, so you won't need to do object_detection_service.process_frame(frame)._images_prediction_lst[0] ( you can already use it with pip install git+https://github.com/Deci-AI/super-gradients)

May 14 '23 12:05 Louis-Dupont

Yes it's only the first inference. Thank you, I will clone the repo. I converted the .pth to.onnx and then the .onnx to .engine with TensorRT to optimize the inference time. Is there any script or doc which explains how to inference an image from a .engine YOLO-NAS model ?

May 14 '23 12:05 Naofel-eal

We just introduced the model.export() for YoloNAS which simplifies the process of exporting the model, and also includes any required pre-post processing steps into the compiled graph for ease of use. You can check out the tutorial, it should cover your needs https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/models_export.md

It should will very soon be added to our official documentation https://docs.deci.ai/super-gradients/documentation/source/welcome.html

Aug 10 '23 08:08 Louis-Dupont

Exporting the model is not addressing the fact that .pt inference is still very slow. I get an average of 60ms for YOLO_NAS_S vs 23ms for YOLO_v8_S. Both tested on GPU.

model = models.get(
    'yolo_nas_s',
    pretrained_weights="coco"
).to(device)

self.pipeline = DetectionPipeline(
    model=model.eval(),
    image_processor=self.model._image_processor,
    post_prediction_callback=self.model.get_post_prediction_callback(iou=args.iou, conf=args.conf),
    class_names=self.model._class_names,
    fuse_model=True
)

im = cv2.imread('/path/to/img.jpg')
preprocessed_image, processing_metadata = self.pipeline.image_processor.preprocess_image(image=im.copy())

with torch.no_grad():
    im = torch.Tensor(preprocessed_image).unsqueeze(0).to(self.device)
    model_output = self.pipeline.model(im)
    preds = self.pipeline._decode_model_output(model_output, model_input=im)[0]

There is still a huge computation overhead somewhere in the pipeline @Louis-Dupont

Aug 31 '23 06:08 mikel-brostrom

https://github.com/mikel-brostrom/yolo_tracking/discussions/1097

Aug 31 '23 06:08 mikel-brostrom

super-gradients super-gradients copied to clipboard

Iterating over predictions is very slow

Describe the bug

To Reproduce

Experiment 1

Experiment 2

super-gradients
super-gradients copied to clipboard