RNN-for-Human-Activity-Recognition-using-2D-Pose-Input
RNN-for-Human-Activity-Recognition-using-2D-Pose-Input copied to clipboard
About multi people to classify
hello,stuarteiffert, very luck to find the project you did,since i have similiar idea with you. I wonder that if there exists multi people in a video sequence, how can we process this situation properly ? the second advice is that if cnn feature and pose info as input to the lstm input like the way that image caption did, do you think it is well?
Hi,
Most recent pose estimation methods work fine with mulitple people in a scene, you would just need to make sure you associate each detection between frames correctly. Openpose does this already, but if you were using something that didn't track, just associating based on pixel x/y positions of a bounding box per skeleton would likely work fine. In a crowd situation, with high occlusion you would need a better method of tracking.
Yes, I think adding visual context as an input to the recurrent network, in the form of feature vectors taken from a cnn, would make it work better. It would require more data, and a much larger network, possibly resulting in longer inference times. If you used a pretrained network like ResNet and took the output of the last conv layer as an input to the recurrent network you could probably cut down on the need for more data and on training time.
Hello, I had an issue where I was trying to do a custom activity detection in a multi-skeletal system. I would like to know if it is possible to detect a custom labelled activity (on a custom dataset) and if yes, how can I implement it on a multi-skeletal system if I want to input 2 skeletons in order to detect a particular interaction activity ?
@stuarteiffert great work on implementing LSTM through openpose skeleton datapoints! Would really appreciate it if you could @nshreyasvi as well! Thank you!
Hi @nshreyasvi and @hwlee96, there's no reason that you couldn't apply the same idea to multi-skeletal inputs.
If you are focused just on 2 person interactions, I would start by simply doubling the input dimensions and using both skeletons as inputs. If you limit the dataset to only contain sequences where both skeletons are present it should train the same as the 1 skeleton model. I assume you would need to change architecture a bit, possibly larger hidden unit number per LSTM.
I don't think this is the best approach though. I think you would need to include high level reasoning in the network for multiple person interactions, rather than just an end-to-end recurrent network. Check out https://github.com/google/next-prediction, they are looking mainly at person-object interactions, but I think a similar approach with multiple people would work better than just extending thew input dimensions to my model.
@stuarteiffert Thank you for the reply and suggestions! Will certainly try them out! Indeed, my goal is to perform predictions on 2 person interactions (based on skeleton features for faster inference), do you know what other implementations can achieve this?
have u done LSTM on multi-skeletal datapoints? @hwlee96 @mengzhangjian