Saving Detections - Possible Issue?
Search before asking
- [X] I have searched the Supervision issues and found no similar feature requests.
Question
My supervision pipeline is saving detection metadata, per these instructions. However, I'm seeing some possible misalignment between the tracker_id part of the detections dict and the custom_data attributes written to each JSON dictionary value.
json_sink.append(detections, entry)
See below for the tracker_id which comes from the detections object versus the custom data (entry dictionary).
I re-added tracker_id_2 within entry to be sure the custom values align with the correct detection, and I see some cases where they do not.
Attached is my script, which is based on the Supervision Speed Estimation example. Please provide guidance if you see any issues with how I'm handling the JSON export.
The video I'm processing detects many vehicles simultaneously, so it's critical for the detection metadata to align correctly with the respective vehicle detections within the JSON output.
[ { "x_min": 708.8834228515625, "y_min": 360.1021728515625, "x_max": 782.1868896484375, "y_max": 417.46990966796875, "class_id": 2, "confidence": 0.8593592047691345, "tracker_id": 3, "class_name": "car", "video_file": "video_segment.mp4", "tracker_id_2": "2", "coordinate_start": "79", "coordinate_end": "113", "distance_in_feet": "34", "time": "0.5", "detection_time": "2024-05-11T15:39:57.833333", "speed_mph": "46.363624" }, { "x_min": 683.052490234375, "y_min": 223.6722412109375, "x_max": 727.069091796875, "y_max": 251.96783447265625, "class_id": 2, "confidence": 0.8138282299041748, "tracker_id": 2, "class_name": "car", "video_file": "video_segment.mp4", "tracker_id_2": "2", "coordinate_start": "79", "coordinate_end": "113", "distance_in_feet": "34", "time": "0.5", "detection_time": "2024-05-11T15:39:57.833333", "speed_mph": "46.363624" } ]
import argparse
from collections import defaultdict, deque
from distutils.util import strtobool
import os
import datetime
import cv2
import numpy as np
from ultralytics import YOLO
import supervision as sv
SOURCE = np.array([[722,201], [964,232], [876,524], [183,411]])
TARGET_WIDTH = 72
TARGET_HEIGHT = 200
TARGET = np.array(
[
[0, 0],
[TARGET_WIDTH - 1, 0],
[TARGET_WIDTH - 1, TARGET_HEIGHT - 1],
[0, TARGET_HEIGHT - 1],
]
)
class ViewTransformer:
def __init__(self, source: np.ndarray, target: np.ndarray) -> None:
source = source.astype(np.float32)
target = target.astype(np.float32)
self.m = cv2.getPerspectiveTransform(source, target)
def transform_points(self, points: np.ndarray) -> np.ndarray:
if points.size == 0:
return points
reshaped_points = points.reshape(-1, 1, 2).astype(np.float32)
transformed_points = cv2.perspectiveTransform(reshaped_points, self.m)
return transformed_points.reshape(-1, 2)
def parse_arguments() -> argparse.Namespace:
parser = argparse.ArgumentParser(
description="Vehicle Speed Estimation using Ultralytics and Supervision"
)
parser.add_argument(
"--source_video_path",
required=True,
help="Path to the source video file",
type=str,
)
parser.add_argument(
"--target_video_path",
required=True,
help="Path to the target video file (output)",
type=str,
)
parser.add_argument(
"--confidence_threshold",
default=0.3,
help="Confidence threshold for the model",
type=float,
)
parser.add_argument(
"--iou_threshold", default=0.7, help="IOU threshold for the model", type=float
)
parser.add_argument(
"--inference_device",
required=False,
default="cpu",
help="Inference Device, cpu Default, mps on apple silicon mac for example",
type=str,
)
parser.add_argument(
"--verbose_logging",
type=lambda x: bool(strtobool(x)),
default=False,
help="Show verbose inference output",
)
return parser.parse_args()
def parse_video_timestamp(filename):
timestamp_str = filename.split("_")[-1].split(".")[0] # This assumes the format is always as expected
return datetime.datetime.strptime(timestamp_str, "%Y%m%d%H%M%S")
def calculate_detection_timestamp(video_end_time, video_duration, frame_number, fps):
# Calculate how many seconds before the end time this frame was taken
seconds_from_end = video_duration - frame_number / fps
detection_time = video_end_time - datetime.timedelta(seconds=seconds_from_end)
return detection_time
if __name__ == "__main__":
args = parse_arguments()
video_info = sv.VideoInfo.from_video_path(video_path=args.source_video_path)
video_file_name = os.path.basename(args.source_video_path)
video_file_name_without_ext, _ = os.path.splitext(video_file_name)
video_end_time = parse_video_timestamp(video_file_name_without_ext)
model = YOLO("yolov8x.pt")
byte_track = sv.ByteTrack(
frame_rate=video_info.fps, track_activation_threshold=args.confidence_threshold
)
thickness = sv.calculate_optimal_line_thickness(
resolution_wh=video_info.resolution_wh
)
text_scale = sv.calculate_optimal_text_scale(resolution_wh=video_info.resolution_wh)
bounding_box_annotator = sv.BoundingBoxAnnotator(thickness=thickness)
label_annotator = sv.LabelAnnotator(
text_scale=text_scale,
text_thickness=thickness,
text_position=sv.Position.BOTTOM_CENTER,
)
trace_annotator = sv.TraceAnnotator(
thickness=thickness,
trace_length=video_info.fps * 2,
position=sv.Position.BOTTOM_CENTER,
)
frame_generator = sv.get_video_frames_generator(source_path=args.source_video_path)
fps = video_info.fps
total_frames = (video_info.total_frames)
video_duration = total_frames / fps
polygon_zone = sv.PolygonZone(polygon=SOURCE)
view_transformer = ViewTransformer(source=SOURCE, target=TARGET)
coordinates = defaultdict(lambda: deque(maxlen=video_info.fps))
frame_number = 0
with sv.VideoSink(args.target_video_path, video_info) as sink:
with sv.JSONSink("output/test.json") as json_sink:
for frame in frame_generator:
frame_number += 1
detection_time = calculate_detection_timestamp(video_end_time, video_duration, frame_number, fps)
detection_time_str = detection_time.strftime("%Y-%m-%d %H:%M:%S")
#result = model(frame)[0]
result = model(frame, device=args.inference_device, verbose=args.verbose_logging)[0]
detections = sv.Detections.from_ultralytics(result)
detections = detections[detections.confidence > args.confidence_threshold]
detections = detections[polygon_zone.trigger(detections)]
detections = detections.with_nms(threshold=args.iou_threshold)
detections = byte_track.update_with_detections(detections=detections)
points = detections.get_anchors_coordinates(
anchor=sv.Position.BOTTOM_CENTER
)
points = view_transformer.transform_points(points=points).astype(int)
for tracker_id, [_, y] in zip(detections.tracker_id, points):
coordinates[tracker_id].append(y)
labels = []
for tracker_id in detections.tracker_id:
time = len(coordinates[tracker_id]) / video_info.fps
coordinate_start = coordinates[tracker_id][-1]
coordinate_end = coordinates[tracker_id][0]
distance_in_feet = abs(coordinate_start - coordinate_end)
if len(coordinates[tracker_id]) < video_info.fps / 2:
labels.append(f"#{tracker_id}")
else:
coordinate_start = coordinates[tracker_id][-1]
coordinate_end = coordinates[tracker_id][0]
distance = abs(coordinate_start - coordinate_end)
time = len(coordinates[tracker_id]) / video_info.fps
#speed = distance / time * 3.6
#labels.append(f"#{tracker_id} {int(speed)} km/h")
# Calculate speed in feet per second
speed_fps = distance_in_feet / time
# Convert speed to mph (1 fps = 0.681818 mph)
speed = speed_fps * 0.681818
labels.append(f"#{tracker_id} {int(speed)} mph")
entry = {
"video_file": video_file_name,
"tracker_id_2": str(tracker_id),
"coordinate_start": str(coordinate_start),
"coordinate_end": str(coordinate_end),
"distance_in_feet": str(distance_in_feet),
"time": str(time),
"detection_time": detection_time.isoformat(), # Store as ISO 8601 string
"speed_mph": str(speed)
}
json_sink.append(detections, entry)
annotated_frame = frame.copy()
annotated_frame = trace_annotator.annotate(
scene=annotated_frame, detections=detections
)
annotated_frame = bounding_box_annotator.annotate(
scene=annotated_frame, detections=detections
)
annotated_frame = label_annotator.annotate(
scene=annotated_frame, detections=detections, labels=labels
)
sink.write_frame(annotated_frame)
#cv2.imshow("frame", annotated_frame)
#if cv2.waitKey(1) & 0xFF == ord("q"):
# break
cv2.destroyAllWindows()
Additional
No response
Hi @n0012 👋🏻 json_sink.append should be called only once for each detections object. The problem is that you call json_sink.append while looping over detections.tracker_id.
...
for tracker_id in detections.tracker_id:
...
json_sink.append(detections, entry)
...
If you do it like this you effectively save all your detections multiple times into same csv file. Try this:
...
for tracker_id in detections.tracker_id:
...
json_sink.append(detections[detections.tracker_id == tracker_id], entry)
...
Thank you, that helps explain it! I'll give this a try and appreciate the support.
No worries ;) I'm happy to help. I'm closing the issue, but if you have more questions, don't hesitate to reach out.
The updated script works on 22 input video files, but fails on 65 mid processing. Would you mind taking a look. I'm using FFMPEG to capture the videos from a live stream. I can share that info too.
python "./utils/helpers.py"
--source_video_path "./input/video_20240511154159.mp4"
--inference_device cuda
--confidence_threshold 0.3
--iou_threshold 0.5
--verbose_logging False
--source_points 722,201,964,232,876,524,183,411
--target_width_ft 72
--target_height_ft 200
Error:
Traceback (most recent call last):
File "/home/dev/computer-vision/flowvision/./utils/helpers.py", line 269, in
Hey @n0012 :wave:,
I'll be helping @SkalskiP with this one.
I can't reproduce it on supervision==0.20.0 (latest release) or superivision==0.21.0.rc4.
Which version are you using? You can find out by running pip freeze | grep supervision.
If it's below 0.20.0, I recommend first trying pip install supervision.
If that doesn't help - pip install supervision==0.21.0.rc4. We've had a handful of tracker issues lately, and the early-release version fixed those.
Thanks @LinasKo. I'm running supervision==0.20.0. Let me try upgrading to superivision==0.21.0.rc4 and give it another try. Will report back.
I can confirm that 0.21.0.rc4 solved my issue. All videos are now processing without any tracker errors. Thanks!
Glad it helped!
Do bear in mind, this is a release candidate version. If you wish to keep up with the future updates, feel free to change the pip install to normal supervision at around mid-June. We'll certainly have these changes deployed by then.