supervision Saving Detections

Search before asking

[X] I have searched the Supervision issues and found no similar feature requests.

Question

My supervision pipeline is saving detection metadata, per these instructions. However, I'm seeing some possible misalignment between the tracker_id part of the detections dict and the custom_data attributes written to each JSON dictionary value.

json_sink.append(detections, entry)

See below for the tracker_id which comes from the detections object versus the custom data (entry dictionary).

I re-added tracker_id_2 within entry to be sure the custom values align with the correct detection, and I see some cases where they do not.

Attached is my script, which is based on the Supervision Speed Estimation example. Please provide guidance if you see any issues with how I'm handling the JSON export.

The video I'm processing detects many vehicles simultaneously, so it's critical for the detection metadata to align correctly with the respective vehicle detections within the JSON output.

[ { "x_min": 708.8834228515625, "y_min": 360.1021728515625, "x_max": 782.1868896484375, "y_max": 417.46990966796875, "class_id": 2, "confidence": 0.8593592047691345, "tracker_id": 3, "class_name": "car", "video_file": "video_segment.mp4", "tracker_id_2": "2", "coordinate_start": "79", "coordinate_end": "113", "distance_in_feet": "34", "time": "0.5", "detection_time": "2024-05-11T15:39:57.833333", "speed_mph": "46.363624" }, { "x_min": 683.052490234375, "y_min": 223.6722412109375, "x_max": 727.069091796875, "y_max": 251.96783447265625, "class_id": 2, "confidence": 0.8138282299041748, "tracker_id": 2, "class_name": "car", "video_file": "video_segment.mp4", "tracker_id_2": "2", "coordinate_start": "79", "coordinate_end": "113", "distance_in_feet": "34", "time": "0.5", "detection_time": "2024-05-11T15:39:57.833333", "speed_mph": "46.363624" } ]

import argparse
from collections import defaultdict, deque
from distutils.util import strtobool
import os 
import datetime

import cv2
import numpy as np
from ultralytics import YOLO

import supervision as sv

SOURCE = np.array([[722,201], [964,232], [876,524], [183,411]])

TARGET_WIDTH = 72
TARGET_HEIGHT = 200

TARGET = np.array(
    [
        [0, 0],
        [TARGET_WIDTH - 1, 0],
        [TARGET_WIDTH - 1, TARGET_HEIGHT - 1],
        [0, TARGET_HEIGHT - 1],
    ]
)


class ViewTransformer:
    def __init__(self, source: np.ndarray, target: np.ndarray) -> None:
        source = source.astype(np.float32)
        target = target.astype(np.float32)
        self.m = cv2.getPerspectiveTransform(source, target)

    def transform_points(self, points: np.ndarray) -> np.ndarray:
        if points.size == 0:
            return points

        reshaped_points = points.reshape(-1, 1, 2).astype(np.float32)
        transformed_points = cv2.perspectiveTransform(reshaped_points, self.m)
        return transformed_points.reshape(-1, 2)


def parse_arguments() -> argparse.Namespace:
    parser = argparse.ArgumentParser(
        description="Vehicle Speed Estimation using Ultralytics and Supervision"
    )
    parser.add_argument(
        "--source_video_path",
        required=True,
        help="Path to the source video file",
        type=str,
    )
    parser.add_argument(
        "--target_video_path",
        required=True,
        help="Path to the target video file (output)",
        type=str,
    )
    parser.add_argument(
        "--confidence_threshold",
        default=0.3,
        help="Confidence threshold for the model",
        type=float,
    )
    parser.add_argument(
        "--iou_threshold", default=0.7, help="IOU threshold for the model", type=float
    )
    parser.add_argument(
        "--inference_device",
        required=False,
        default="cpu",
        help="Inference Device, cpu Default, mps on apple silicon mac for example",
        type=str,
    )
    parser.add_argument(
        "--verbose_logging",
        type=lambda x: bool(strtobool(x)),
        default=False,
        help="Show verbose inference output",
    )

    return parser.parse_args()

def parse_video_timestamp(filename):
    timestamp_str = filename.split("_")[-1].split(".")[0]  # This assumes the format is always as expected
    return datetime.datetime.strptime(timestamp_str, "%Y%m%d%H%M%S")

def calculate_detection_timestamp(video_end_time, video_duration, frame_number, fps):
    # Calculate how many seconds before the end time this frame was taken
    seconds_from_end = video_duration - frame_number / fps
    detection_time = video_end_time - datetime.timedelta(seconds=seconds_from_end)
    return detection_time

if __name__ == "__main__":
    args = parse_arguments()

    video_info = sv.VideoInfo.from_video_path(video_path=args.source_video_path)
    video_file_name = os.path.basename(args.source_video_path)
    video_file_name_without_ext, _ = os.path.splitext(video_file_name)
    video_end_time = parse_video_timestamp(video_file_name_without_ext)
    
    model = YOLO("yolov8x.pt")

    byte_track = sv.ByteTrack(
        frame_rate=video_info.fps, track_activation_threshold=args.confidence_threshold
    )

    thickness = sv.calculate_optimal_line_thickness(
        resolution_wh=video_info.resolution_wh
    )
    text_scale = sv.calculate_optimal_text_scale(resolution_wh=video_info.resolution_wh)
    bounding_box_annotator = sv.BoundingBoxAnnotator(thickness=thickness)
    label_annotator = sv.LabelAnnotator(
        text_scale=text_scale,
        text_thickness=thickness,
        text_position=sv.Position.BOTTOM_CENTER,
    )
    trace_annotator = sv.TraceAnnotator(
        thickness=thickness,
        trace_length=video_info.fps * 2,
        position=sv.Position.BOTTOM_CENTER,
    )

    frame_generator = sv.get_video_frames_generator(source_path=args.source_video_path)

    fps = video_info.fps
    total_frames = (video_info.total_frames)
    video_duration = total_frames / fps

    polygon_zone = sv.PolygonZone(polygon=SOURCE)
    view_transformer = ViewTransformer(source=SOURCE, target=TARGET)

    coordinates = defaultdict(lambda: deque(maxlen=video_info.fps))

    frame_number = 0

    with sv.VideoSink(args.target_video_path, video_info) as sink:
        with sv.JSONSink("output/test.json") as json_sink:
            for frame in frame_generator:
                frame_number += 1

                detection_time = calculate_detection_timestamp(video_end_time, video_duration, frame_number, fps)
                detection_time_str = detection_time.strftime("%Y-%m-%d %H:%M:%S")

                #result = model(frame)[0]
                result = model(frame, device=args.inference_device, verbose=args.verbose_logging)[0]
                detections = sv.Detections.from_ultralytics(result)
                detections = detections[detections.confidence > args.confidence_threshold]
                detections = detections[polygon_zone.trigger(detections)]
                detections = detections.with_nms(threshold=args.iou_threshold)
                detections = byte_track.update_with_detections(detections=detections)

                points = detections.get_anchors_coordinates(
                    anchor=sv.Position.BOTTOM_CENTER
                )
                points = view_transformer.transform_points(points=points).astype(int)

                for tracker_id, [_, y] in zip(detections.tracker_id, points):
                    coordinates[tracker_id].append(y)

                labels = []
                for tracker_id in detections.tracker_id:

                    time = len(coordinates[tracker_id]) / video_info.fps
                    coordinate_start = coordinates[tracker_id][-1]
                    coordinate_end = coordinates[tracker_id][0]

                    distance_in_feet = abs(coordinate_start - coordinate_end)

                    if len(coordinates[tracker_id]) < video_info.fps / 2:
                        labels.append(f"#{tracker_id}")
                    else:
                        coordinate_start = coordinates[tracker_id][-1]
                        coordinate_end = coordinates[tracker_id][0]
                        distance = abs(coordinate_start - coordinate_end)
                        time = len(coordinates[tracker_id]) / video_info.fps
                        #speed = distance / time * 3.6
                        #labels.append(f"#{tracker_id} {int(speed)} km/h")
                        # Calculate speed in feet per second
                        speed_fps = distance_in_feet / time
                        # Convert speed to mph (1 fps = 0.681818 mph)
                        speed = speed_fps * 0.681818
                        labels.append(f"#{tracker_id} {int(speed)} mph")
                        entry = {
                                "video_file": video_file_name,
                                "tracker_id_2": str(tracker_id),
                                "coordinate_start": str(coordinate_start),
                                "coordinate_end": str(coordinate_end),
                                "distance_in_feet": str(distance_in_feet),
                                "time": str(time),
                                "detection_time": detection_time.isoformat(),  # Store as ISO 8601 string
                                "speed_mph": str(speed)
                            }
                        json_sink.append(detections, entry)

                annotated_frame = frame.copy()
                annotated_frame = trace_annotator.annotate(
                    scene=annotated_frame, detections=detections
                )
                annotated_frame = bounding_box_annotator.annotate(
                    scene=annotated_frame, detections=detections
                )
                annotated_frame = label_annotator.annotate(
                    scene=annotated_frame, detections=detections, labels=labels
                )

                sink.write_frame(annotated_frame)
                #cv2.imshow("frame", annotated_frame)
                #if cv2.waitKey(1) & 0xFF == ord("q"):
                #    break
        cv2.destroyAllWindows()

Additional

No response

May 14 '24 14:05 n0012

Hi @n0012 👋🏻 json_sink.append should be called only once for each detections object. The problem is that you call json_sink.append while looping over detections.tracker_id.

...
for tracker_id in detections.tracker_id:
   ...
   json_sink.append(detections, entry)
   ...

If you do it like this you effectively save all your detections multiple times into same csv file. Try this:

...
for tracker_id in detections.tracker_id:
   ...
   json_sink.append(detections[detections.tracker_id == tracker_id], entry)
   ...

May 14 '24 20:05 SkalskiP

Thank you, that helps explain it! I'll give this a try and appreciate the support.

May 14 '24 22:05 n0012

No worries ;) I'm happy to help. I'm closing the issue, but if you have more questions, don't hesitate to reach out.

May 15 '24 13:05 SkalskiP

The updated script works on 22 input video files, but fails on 65 mid processing. Would you mind taking a look. I'm using FFMPEG to capture the videos from a live stream. I can share that info too.

helpers python script
example input videos that processed: -- processed
example input videos that failed: -- failed

python "./utils/helpers.py"
--source_video_path "./input/video_20240511154159.mp4"
--inference_device cuda
--confidence_threshold 0.3
--iou_threshold 0.5
--verbose_logging False
--source_points 722,201,964,232,876,524,183,411
--target_width_ft 72
--target_height_ft 200

Error: Traceback (most recent call last): File "/home/dev/computer-vision/flowvision/./utils/helpers.py", line 269, in annotated_frame = trace_annotator.annotate( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.venv/lib/python3.11/site-packages/supervision/utils/conversion.py", line 21, in wrapper return annotate_func(self, scene, *args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/dev/.venv/lib/python3.11/site-packages/supervision/annotators/core.py", line 1297, in annotate self.trace.put(detections) File "/home/dev/.venv/lib/python3.11/site-packages/supervision/annotators/utils.py", line 116, in put self.tracker_id = self.tracker_id[filtering_mask] ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^ IndexError: boolean index did not match indexed array along dimension 0; dimension is 158 but corresponding boolean dimension is 159

May 15 '24 23:05 n0012

Hey @n0012 :wave:,

I'll be helping @SkalskiP with this one.

I can't reproduce it on supervision==0.20.0 (latest release) or superivision==0.21.0.rc4.

Which version are you using? You can find out by running pip freeze | grep supervision.

If it's below 0.20.0, I recommend first trying pip install supervision. If that doesn't help - pip install supervision==0.21.0.rc4. We've had a handful of tracker issues lately, and the early-release version fixed those.

May 16 '24 07:05 LinasKo

Thanks @LinasKo. I'm running supervision==0.20.0. Let me try upgrading to superivision==0.21.0.rc4 and give it another try. Will report back.

May 16 '24 13:05 n0012

I can confirm that 0.21.0.rc4 solved my issue. All videos are now processing without any tracker errors. Thanks!

May 16 '24 14:05 n0012

Glad it helped!

Do bear in mind, this is a release candidate version. If you wish to keep up with the future updates, feel free to change the pip install to normal supervision at around mid-June. We'll certainly have these changes deployed by then.

May 16 '24 19:05 LinasKo

Saving Detections - Possible Issue?

Search before asking

Question

Additional