supervision icon indicating copy to clipboard operation
supervision copied to clipboard

Gaze Detection Utilities

Open capjamesg opened this issue 2 years ago • 6 comments
trafficstars

Search before asking

  • [X] I have searched the Supervision issues and found no similar feature requests.

Description

The Roboflow Inference Server now supports gaze detection but there is a lot of scaffolding code you need to write to annotate the predictions. I would love to discuss a gaze detection annotator in supervision for use in annotating gazes. I am unsure what form the API should take.

Inference returns predictions in this form:

[{'predictions': [{'face': {'x': 940.0, 'y': 600.0, 'width': 396.0, 'height': 396.0, 'confidence': 0.8872392177581787, 'class': 'face', 'class_confidence': None, 'class_id': 0, 'tracker_id': None, 'landmarks': [{'x': 881.0, 'y': 491.0}, {'x': 1054.0, 'y': 526.0}, {'x': 966.0, 'y': 595.0}, {'x': 944.0, 'y': 680.0}, {'x': 752.0, 'y': 516.0}, {'x': 1119.0, 'y': 582.0}]}, 'yaw': 0.12384924292564392, 'pitch': -0.16699858009815216}], 'time': 0.14199933400004738, 'time_face_det': None, 'time_gaze_det': None}]

Predictions consist of an arbitrary number of face bounding boxes in the Roboflow format, a list of facial landmarks, and pitch and yaw values for use in inferring gaze.

Should we have a Gaze dataclass to ingest this? Or add a gaze value to the Detections class? I opt for the latter since the predictions filtering available in supervision is still applicable in gaze detection applications.

Use case

The use case is to make it easy for people to plot gaze detections.

Additional

No response

Are you willing to submit a PR?

  • [X] Yes I'd like to help by submitting a PR!

capjamesg avatar Sep 19 '23 08:09 capjamesg

@capjamesg I prefer not the see in detection class file directly instead of creating something new for "Faces" (file already too long and big)

Nice format of what you posted on inference server (easier for us to read :)) )


[
    {
        "predictions": [
            {
                "face": {
                    "x": 940.0,
                    "y": 600.0,
                    "width": 396.0,
                    "height": 396.0,
                    "confidence": 0.8872392177581787,
                    "class": "face",
                    "class_confidence": None,
                    "class_id": 0,
                    "tracker_id": None,
                    "landmarks": [
                        {"x": 881.0, "y": 491.0},
                        {"x": 1054.0, "y": 526.0},
                        {"x": 966.0, "y": 595.0},
                        {"x": 944.0, "y": 680.0},
                        {"x": 752.0, "y": 516.0},
                        {"x": 1119.0, "y": 582.0},
                    ],
                },
                "yaw": 0.12384924292564392,
                "pitch": -0.16699858009815216,
            }
        ],
        "time": 0.14199933400004738,
        "time_face_det": None,
        "time_gaze_det": None,
    }
]



For reference to everyone else:

https://github.com/Ahmednull/L2CS-Net https://github.com/roboflow/inference/pull/20

I also tested original repo for see the values

GazeResultContainer(pitch=array([0.38433596, 0.3832036 ], dtype=float32), yaw=array([-0.2506285,  0.2759738], dtype=float32), bboxes=array([[243.68126  ,   8.761246 , 422.3376   , 247.67935  ],
       [ -4.6252966,  41.277515 , 113.18152  , 188.35295  ]],
      dtype=float32), landmarks=array([[[262.18356 ,  98.60664 ],
        [332.88895 ,  83.04035 ],
        [278.9314  , 135.38498 ],
        [289.23178 , 189.30411 ],
        [347.59564 , 174.8707  ]],

       [[ 31.351639,  93.41662 ],
        [ 79.39963 ,  90.40505 ],
        [ 55.32864 , 111.44813 ],
        [ 27.736622, 131.99054 ],
        [ 74.23118 , 129.81989 ]]], dtype=float32), scores=array([0.9995321, 0.7445839], dtype=float32))

More complete basic example (their code on readme is missing parts)

from l2cs import Pipeline, render
import cv2
import pathlib 
import torch

gaze_pipeline = Pipeline(
    weights=pathlib.Path("L2CSNet_gaze360.pkl"),
    arch='ResNet50',
    device=torch.device("cpu") # or 'gpu'
)
 
cap = cv2.VideoCapture(1)
_, frame = cap.read()    

# Process frame and visualize
results = gaze_pipeline.step(frame)
print(results)
frame = render(frame, results)
cv2.imshow("frame", frame)
cv2.waitKey(0)

onuralpszr avatar Sep 19 '23 09:09 onuralpszr

Thank you for your research on this! We definitely need to think about the API. I can see supervision supporting landmarks, in which case Face is more appropriate since it is more generic. Gazes could be part of that API.

capjamesg avatar Sep 19 '23 09:09 capjamesg

Thank you for your research on this! We definitely need to think about the API. I can see supervision supporting landmarks, in which case Face is more appropriate since it is more generic. Gazes could be part of that API.

Something like "FaceDetection" or "Face" is better so I wanted to add mediapipe and others I know) but I was able to make "false positive" too easy. There was a object in my house on the wall was a circular object and easily deceived by that. (object no face or something it just completely different) I am curious is there better trained one or you guys using default one in the repo from google drive ?

onuralpszr avatar Sep 19 '23 09:09 onuralpszr

My go-to is a fine-tined object detection model for face detection (i.e. https://universe.roboflow.com/mohamed-traore-2ekkp/face-detection-mik1i).

capjamesg avatar Sep 19 '23 09:09 capjamesg

My go-to is a fine-tined object detection model for face detection (i.e. https://universe.roboflow.com/mohamed-traore-2ekkp/face-detection-mik1i).

I meant for "gaze detection" If I was misunderstood

onuralpszr avatar Sep 19 '23 09:09 onuralpszr

Oh my bad! We have had good results with L2CSNet_gaze360_resnet50_90bins in https://github.com/Ahmednull/L2CS-Net that you looked at. I have only tested in images where the primary subject of the image is a face.

capjamesg avatar Sep 19 '23 09:09 capjamesg

Hi everyone 👋🏻 I agree with @onuralpszr. We need to look for a more general solution, probably as part of our upcoming Key Point API.

SkalskiP avatar Jan 26 '24 09:01 SkalskiP