mediapipe icon indicating copy to clipboard operation
mediapipe copied to clipboard

Improving PoseLandmarker multipose

Open nickph7 opened this issue 1 year ago • 1 comments

Have I written custom code (as opposed to using a stock example script provided in MediaPipe)

No

OS Platform and Distribution

Linux Ubuntu 20.04

MediaPipe Tasks SDK version

0.10.3

Task name (e.g. Image classification, Gesture recognition etc.)

PoseLandmarker

Programming Language and version (e.g. C++, Python, Java)

Python 3.8

Describe the actual behavior

When two detected individuals approach each other closely, the detection of one person is canceled. Typically, this occurs when the second person is about 75 cm away from the first person, at a distance of 3.5 meters from the camera. Our most successful attempt at that distance was with a 70 cm gap, using a detection threshold of 0.9, and maintaining presence and detection thresholds at 0.5. As anticipated from the model card, we can achieve an even smaller distance of 50 cm by setting detection, presence, and tracking thresholds to 0.5, 0.9, and 0.5 respectively. We explored various scenarios and variables, such as the order of detections over time, horizontal and depth positioning within the image. It can be hypothesized that the individual whom the model identifies with greater confidence is retained, while the other one is discarded.

Describe the expected behaviour

Other models, such as PoseNet or MoveNet, do not suffer from the same level of detection cancellation. In the case of MoveNet, two individuals can stand shoulder to shoulder with their detections remaining intact. Likewise, we utilize PoseNet in several of our projects, and we encounter this issue less frequently.

Our goal is to enhance the performance of the PoseLandmarker, aiming for comparability, or ideally, superiority to those of PoseNet or MoveNet. One aspect we would like to understand better is the presence threshold parameter. Based on the current description in the documentation, it is not clear to us how this parameter affects the detections when paired with the detection and tracking thresholds. Additionally, we are open to experimenting with the model and the graph in the C++ low-code, as we have some experience in tweaking calculators and graphs. Any guidance on areas to investigate and suggested approaches for experimentation would be greatly appreciated

We have found the stability of the PoseLandmarker to be remarkably more stable and accurate to what we currently achieve with PoseNet. Moreoever, we value the capability to get the segmentation mask for each detection which is why we would like to improve the PoseLandmarker task for multiple users.

Standalone code/steps you may have used to try to get what you need

Code snippets taken from guide and examples:


import cv2
import numpy as np
import mediapipe as mp
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
from mediapipe.framework.formats import landmark_pb2

#640x480, 1280x720, 1920x1080
model_path = "pose_landmarker_full.task"
window_name = "MediaPipe Pose Landmark"
device_id = 4
width = 1280
height = 720
fps = 30
num_poses = 4
min_pose_detection_confidence = 0.5
min_pose_presence_confidence = 0.5
min_tracking_confidence = 0.5


def draw_landmarks_on_image(rgb_image, detection_result):
    pose_landmarks_list = detection_result.pose_landmarks
    annotated_image = np.copy(rgb_image)

    # Loop through the detected poses to visualize.
    for idx in range(len(pose_landmarks_list)):
        pose_landmarks = pose_landmarks_list[idx]

        pose_landmarks_proto = landmark_pb2.NormalizedLandmarkList()
        pose_landmarks_proto.landmark.extend([
            landmark_pb2.NormalizedLandmark(
                x=landmark.x,
                y=landmark.y,
                z=landmark.z) for landmark in pose_landmarks
        ])
        mp.solutions.drawing_utils.draw_landmarks(
            annotated_image,
            pose_landmarks_proto,
            mp.solutions.pose.POSE_CONNECTIONS,
            mp.solutions.drawing_styles.get_default_pose_landmarks_style())
    return annotated_image


to_window = None
last_timestamp_ms = 0


def print_result(detection_result: vision.PoseLandmarkerResult, output_image: mp.Image,
                 timestamp_ms: int):
    global to_window
    global last_timestamp_ms
    if timestamp_ms < last_timestamp_ms:
        return
    last_timestamp_ms = timestamp_ms
    # print("pose landmarker result: {}".format(detection_result))
    to_window = cv2.cvtColor(
        draw_landmarks_on_image(output_image.numpy_view(), detection_result), cv2.COLOR_RGB2BGR)


base_options = python.BaseOptions(model_asset_path=model_path)
options = vision.PoseLandmarkerOptions(
    base_options=base_options,
    running_mode=vision.RunningMode.LIVE_STREAM,
    num_poses=num_poses,
    min_pose_detection_confidence=min_pose_detection_confidence,
    min_pose_presence_confidence=min_pose_presence_confidence,
    min_tracking_confidence=min_tracking_confidence,
    output_segmentation_masks=False,
    result_callback=print_result
)

with vision.PoseLandmarker.create_from_options(options) as landmarker:
    # Use OpenCV’s VideoCapture to start capturing from the webcam.
    cap = cv2.VideoCapture(device_id, cv2.CAP_V4L2)
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
    cap.set(cv2.CAP_PROP_FPS, fps)

    # Create a loop to read the latest frame from the camera using VideoCapture#read()
    while cap.isOpened():
        success, image = cap.read()
        if not success:
            print("Image capture failed.")
            break

        # Convert the frame received from OpenCV to a MediaPipe’s Image object.
        mp_image = mp.Image(
            image_format=mp.ImageFormat.SRGB,
            data=cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
        timestamp_ms = int(cv2.getTickCount() / cv2.getTickFrequency() * 1000)
        landmarker.detect_async(mp_image, timestamp_ms)

        if to_window is not None:
            cv2.imshow(window_name, to_window)

        if cv2.waitKey(1) & 0xFF == ord('q'):
            break

    cap.release()
    cv2.destroyAllWindows()


### Other info / Complete Logs

_No response_

nickph7 avatar Aug 08 '23 20:08 nickph7

@nickph7,

Pose model can only detect single person from a image or video frame cropped to contain only one person. If you check page 2-> Limitations section of Pose Model Card, you will find the following information:

Tracks only one person on scene if multiple present

However, We are marking this as feature request and sharing internally based on the discussion team will prioritise the work.

Thank you!

kuaashish avatar Aug 09 '23 09:08 kuaashish