darknet icon indicating copy to clipboard operation
darknet copied to clipboard

Action Recognition

Open jamessmith90 opened this issue 4 years ago • 11 comments

Looking to recognize if a person is walking or running. Can this be done using darknet ? If yes can you tell me what changes i need to do in the build and what would be the format of the training dataset ?

jamessmith90 avatar Dec 06 '19 11:12 jamessmith90

maybe you can use some data from here https://research.google.com/ava/

LukeAI avatar Dec 06 '19 12:12 LukeAI

@LukeAI I already have the dataset prepared. I just need the changes in the repo and training format.

jamessmith90 avatar Dec 06 '19 12:12 jamessmith90

show some examples of data - are you classifying images or detecting the locations of running-person walking-person (maybe more than one per image?)

LukeAI avatar Dec 06 '19 12:12 LukeAI

I have coordinates and frame number tagged with category -- walking, running, standing

jamessmith90 avatar Dec 06 '19 13:12 jamessmith90

@jamessmith90

You can try to use LSTM-models f.e: https://github.com/AlexeyAB/darknet/files/3199770/yolo_v3_tiny_lstm.cfg.txt

Since it uses time_steps=16 in yolo_v3_tiny_lstm.cfg.txt then

  1. in train.txt should be placed training images, by 16 consecutive images (frames) from the video, with this action (person is walking)

    • or negative samples - 16 consecutive images (frames) from the video, without this action, f.e. person (a person sits, or a street without a person)
    • so number of images in train.txt should be multiple of 16.
  2. Change this line start_time_indexes[i] = ((random_gen() % m) / 16) * 16; and recompile https://github.com/AlexeyAB/darknet/blob/5d13aad8879e1630145bb90208db518037d707a3/src/data.c#L55

  3. May be you should leave unmarked the first 8 images, and mark only the last 8 images. This is only necessary if you want to distinguish: whether a person is moving, or the person just froze in a pose similar to movement. In theory, this should be clear after 8 frames.

More: https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-494154586

For training: Download https://pjreddie.com/media/files/yolov3-tiny.weights then do command ./darknet partial cfg/yolov3-tiny.cfg yolov3-tiny.weights yolov3-tiny.conv.14 14 - you will get yolov3-tiny.conv.14 pre-trained file.

Then train as usual: ./darknet detector train data/obj.data yolo_v3_tiny_lstm.cfg.txt yolov3-tiny.conv.14 -map

Then train both models on the same train/valid dataset and compare mAP:

  • LSTM-model: https://github.com/AlexeyAB/darknet/files/3199770/yolo_v3_tiny_lstm.cfg.txt
  • and yolov3-tiny_3l model: https://github.com/AlexeyAB/darknet/files/3199607/yolov3-tiny_3l.cfg.txt

Will be mAP higher in LSTM for your case?

If LSTM-model will be better, then I will add param train_seq_frames_num=16 to cfg-file, so you will not need to change the source code.

AlexeyAB avatar Dec 06 '19 19:12 AlexeyAB

What is format for train.txt for 16 consecutive frames and what would be format of the .txt file for each image ?

jamessmith90 avatar Dec 09 '19 10:12 jamessmith90

Everything is the same as usual.

train.txt

video_1_frame_1.jpg
video_1_frame_2.jpg
video_1_frame_3.jpg
....
video_1_frame_16.jpg
video_2_frame_1.jpg
video_2_frame_2.jpg
...
video_2_frame_16.jpg
....

Files video_1_frame_1.txt as usual https://github.com/AlexeyAB/darknet#how-to-train-to-detect-your-custom-objects

0 0.5 0.5 0.2 0.2
1 0.3 0.3 0.1 0.1
....

AlexeyAB avatar Dec 09 '19 11:12 AlexeyAB

I have added the model for training.

I have another doubt. https://www.crcv.ucf.edu/data/UCF101.php UCF-101 has 101 categories for action recognition. Can the same logic be used for video classification ?

uday60 avatar Dec 12 '19 08:12 uday60

@AlexeyAB Is it necessary to include 16 frames ? Can i use 12 ?

jamessmith90 avatar Dec 17 '19 12:12 jamessmith90

@jamessmith90 Yes, you can. The more frames - the better.

AlexeyAB avatar Dec 17 '19 13:12 AlexeyAB

@AlexeyAB @jamessmith90 how to inference a Yolo lstm model on a video? is it normal frame-by-frame or should I give a total of 16 frames at a time?

snehashis1997 avatar Dec 25 '23 18:12 snehashis1997