3D-ResNets-PyTorch icon indicating copy to clipboard operation
3D-ResNets-PyTorch copied to clipboard

how to predict the label using webcam

Open junmin98 opened this issue 3 years ago • 5 comments

I want to predict the class label using webcam. so I got the frame of webcam image, then put it to model.


frame = torch.tensor(frame)

            output = model(frame)

but I got the error

RuntimeError: Expected 5-dimensional input for 5-dimensional weight [64, 3, 7, 7, 7], but got 3-dimensional input of size [480, 640, 3] instead

I think that i should spatial and temporal transform.. however if i do spatial transform

frame = spatial_transform(frame)

Traceback (most recent call last): File "test_realtime.py", line 134, in frame = spatial_transform(frame) File "/home/junmin/Desktop/3D-ResNets-PyTorch/spatial_transforms.py", line 30, in call img = t(img) File "/home/junmin/Desktop/3D-ResNets-PyTorch/spatial_transforms.py", line 151, in call w, h = img.size TypeError: 'builtin_function_or_method' object is not iterable

can you tell me how to use the model to predict class label with webcam frame image.

junmin98 avatar Sep 25 '20 03:09 junmin98

Hello @junmin98. I did this way.

First you have to save up to 16 frames at least, it's the default temporal window.

import cv2
# we create the video capture object cap
cap = cv2.VideoCapture(0)
# Frame's list for HAR
full_clip = []

... your code ...

while True:
        ret, frame = cap.read()
        
        # Save frame's list
        full_clip.append(frame)
        if len(full_clip)>16:
                del full_clip[0]

        ... your code ...

        # show us frame with detection
        cv2.imshow("Web cam input", frame)
        if cv2.waitKey(25) & 0xFF == ord("q"):
            cv2.destroyAllWindows()
            break

cap.release()
cv2.destroyAllWindows()

Then, when you have at least 16 frames in your list, you will do the spatial transform (temporal isn't needed unless you need to do the time synchronization used in the training, more of that later), apply the model and get best class output.

import torch
import torch.nn.functional as F
from PIL import Image
import numpy as np

def get_normalize_method(mean, std, no_mean_norm, no_std_norm):
    if no_mean_norm:
        if no_std_norm:
            return Normalize([0, 0, 0], [1, 1, 1])
        else:
            return Normalize([0, 0, 0], std)
    else:
        if no_std_norm:
            return Normalize(mean, [1, 1, 1])
        else:
            return Normalize(mean, std)

def get_spatial_transform(opt):
        normalize = get_normalize_method(opt.mean, opt.std, opt.no_mean_norm,
                                     opt.no_std_norm)
        spatial_transform = [Resize(opt.sample_size)]
        if self.opt.inference_crop == 'center':
            spatial_transform.append(CenterCrop(opt.sample_size))
        spatial_transform.append(ToTensor())
        spatial_transform.extend([ScaleValue(opt.value_scale), normalize])
        spatial_transform = Compose(spatial_transform)
        return spatial_transform

def preprocessing(clip, spatial_transform):
    # Applying spatial transformations
    if spatial_transform is not None:
        spatial_transform.randomize_parameters()
        # Before applying spatial transformation you need to convert your frame into PIL Image format (its not the best way, but works)
        clip = [spatial_transform(Image.fromarray(np.uint8(img)).convert('RGB')) for img in clip]
    # Rearange shapes to fit model input
    clip = torch.stack(clip, 0).permute(1, 0, 2, 3)
    clip = torch.stack((clip,), 0)
    return clip

def predict(clip, model, spatial_transform, classes):
    # Set mode eval mode
    model.eval()
    # do some preprocessing steps
    clip = preprocessing(clip, spatial_transform)
    # don't calculate grads
    with torch.no_grad():
        # apply model to input
        outputs = model(clip)
        # apply softmax and move from gpu to cpu
        outputs = F.softmax(outputs, dim=1).cpu()
        # get best class
        score, class_prediction = torch.max(outputs, 1)
    # As model outputs a index class, if you have the real class list you can get the output class name
    # something like this: classes = ['jump', 'talk', 'walk', ...]
    if classes != None:
        return score[0], classes[class_prediction[0]]
    return score[0], class_prediction[0]

About temporal window

When you are training a model you can define if you want 16 consecutives frames or if you want to skip some like, take 1 frame, skip 3, take another one, .... But there is something that is not considered, the processing time that your machine you take to apply all this code and the cv2 code to get frame. In real-time human action detect lets say you will get a 10 fps, but in the training time we worked with 30 fps videos, so if you just thing in frame terms you will not be synchronizing the temporal window you had used to train your model.

guilhermesurek avatar Sep 26 '20 14:09 guilhermesurek

@guilhermesurek @kenshohara does this pre-processing step still holds true if I am using the resnet50 model which was fine-tuned on the UCF101 dataset?. For training and inference, we use the jpg images generated for each video and I thought something similar would have to be done for any video for testing.

This looks like a decent way to test the videos. We only have to create an array or file with class names for mapping. The class order should be the same as the class order present while training the model.

I am not sure if the model uses only spatial transform data or temporal transformations also while training and am hence confused.

I am trying to predict the output on a video outside of the dataset. Can you tell me the steps to perform that?

Please advise

Let me know if I need to provide more information like training parameters

Purav-Zumkhawala avatar Mar 02 '21 07:03 Purav-Zumkhawala

@guilhermesurek Hi, Thank you for your sharing, But I had another problem: Do you know how to get model() I had error every time --> frame = model() Can you share the full code for webcam? Thank you in advance!

TonyLi-Shu avatar Mar 22 '21 16:03 TonyLi-Shu

@Purav-Zumkhawala, i will try to explain, let me know if I was'nt very clear.

does this pre-processing step still holds true if I am using the resnet50 model which was fine-tuned on the UCF101 dataset?. For training and inference, we use the jpg images generated for each video and I thought something similar would have to be done for any video for testing.

Yes, but not all. You need to normalize the input the same way you normalize in the training phase. And then, convert to tensor. Other spatial transformations done in the training are not necessary. And yes, you need to do something similar if you need to test on other videos, but the original code doesn't have this functionality.

This looks like a decent way to test the videos. We only have to create an array or file with class names for mapping. The class order should be the same as the class order present while training the model.

This is the way when testing on video files.

I am not sure if the model uses only spatial transform data or temporal transformations also while training and am hence confused.

The model used both. I'm using the standards for testing from the main code.

I am trying to predict the output on a video outside of the dataset. Can you tell me the steps to perform that?

You just need to check your temporal window, in other words, the fps from your videos. If you use a 120 fps video for testing it would have 4 times more frames than the original ones and you will get, probably, wrong labels.

guilhermesurek avatar Mar 23 '21 01:03 guilhermesurek

@TonyLi-Shu try to use this https://github.com/guilhermesurek/computer-vision-framework and please, all credits to Keshohara and his team.

guilhermesurek avatar Mar 23 '21 01:03 guilhermesurek