c3d-pytorch
c3d-pytorch copied to clipboard
pretrained weights
Hi,
First, thank you very much for contributing this c3d implementation in pytorch! I had a question on the origin of the pretrained weights, did you obtain them by converting them from another source or by training the network yourself ?
Hi! Thanks for interest. Here's the thing:
I started from the weights provided in keras by this gist. Such weights were ported from the original caffe repo.
In the keras porting gist, a easy prediction code is presented along with the results, that I copy paste here:
Top 5 probabilities and labels: 0.45910 basketball 0.39566 streetball 0.02090 greco-roman wrestling 0.01479 freestyle wrestling 0.01391 slamball
Now, if I run the Keras code loading the weights provided, I obtain
Top 5 probabilities and labels: 0.52728 basketball 0.29820 streetball 0.02856 greco-roman wrestling 0.02103 freestyle wrestling 0.01411 wrestling
So, something in the keras weights drifted a bit. My pytorch porting yields these last results. This means that my porting keras->pytorch is correct, but the my-keras, gist-keras discrepancy is propagated to pytorch as well :(
Hope I made the issue clear.
I tried to downgrade my keras to many older versions (back 'till 1.0.2), but still obtained the same results.
I leave this issue opened since I'm still trying to align perfectly to keras gist. Happy to get insights from you or anyone else.
Best, D
I saw your predict.py, but I didn't find the mean subtraction step. Maybe this is the reason !
Has this been resolved by @happygds solution? Or better what kind of preprocessing do I have to do to use the weights?
@DavideA I used your code with the weights of 'c3d.pickle' in PyTorch 0.4.0 to predict the action of Roger Federer. But I receive the results like that: Top 5: 0.11345 backpacking (wilderness) 0.05290 hiking 0.05005 longboarding 0.02464 base jumping 0.02359 whitewater kayaking it's a lot different from your results. Do you know the reasons? Is it because that we didn't use the mean subtraction step? @happygds
@BarryBA I just ran the prediction script (without removing the mean) with PyTorch 0.4.0 and it provides me the correct results.
Really weird.
I made a mistake in the stride and padding size of the pool5 layer. And now the results are correct! Small parameters, huge impact! Thank you very much to your reply. Good luck!
Hi,
Thanks for the impelemetation.
I dont know how transfer the weight in kears to pytorch(.h5-->pickle),could you show me your source code?
another question is that you dont sub the mean in your prediect.py ,does it means that you dont sub the mean when you train too?
Thanks
Hi and thanks for interest.
I can share with you the snippet to save parameters from keras (v=1.2.2) to file:
# save weights
import os.path
import numpy as np
from keras.utils.conv_utils import convert_kernel
import os
out_dir = 'layers_weights'
if not os.path.exists(out_dir):
os.makedirs(out_dir)
for l in model.layers:
layer_name = l.name
if l.weights:
w, b = l.get_weights()
np.save(os.path.join(out_dir, layer_name + '_w.npy'), np.array(convert_kernel(w)))
np.save(os.path.join(out_dir, layer_name + '_b.npy'), np.array(b))
Then as you build the pytorch model you can load them into the .data
attribute of each variable.
Concerning the mean: if you train subtracting the mean, you test subtracting the mean. Otherwise you don't. I am not sure whether in the original caffe implementation this step was performed or not, I just wanted to reproduce the keras gist mentioned above.
Hope this helps, D
Hi and thanks for interest.
I can share with you the snippet to save parameters from keras (v=1.2.2) to file:
# save weights import os.path import numpy as np from keras.utils.conv_utils import convert_kernel import os out_dir = 'layers_weights' if not os.path.exists(out_dir): os.makedirs(out_dir) for l in model.layers: layer_name = l.name if l.weights: w, b = l.get_weights() np.save(os.path.join(out_dir, layer_name + '_w.npy'), np.array(convert_kernel(w))) np.save(os.path.join(out_dir, layer_name + '_b.npy'), np.array(b))
Then as you build the pytorch model you can load them into the
.data
attribute of each variable.Concerning the mean: if you train subtracting the mean, you test subtracting the mean. Otherwise you don't. I am not sure whether in the original caffe implementation this step was performed or not, I just wanted to reproduce the keras gist mentioned above.
Hope this helps, D
Thank you.I know how to transfer the weight now. I considered that there was a function could transfer the weight from keras to pytorch directly but may not have such a function.
hi, i am getting,
Top 5:
0.92100 tennis
0.01580 padel tennis
0.01240 softball
0.00891 soft tennis
0.00687 aggressive inline skating
@EMCL you are right. if you just run the 'predict.py' on the video provided in this repository, you will get that. I also got the same.
The predicted discussed in this thread is not about that video. they are talking about the video used here: https://gist.github.com/albertomontesg/d8b21a179c1e6cca0480ebdf292c34d2
I tested it on pytorch 1.0 + cuda 9.2 + python 3.6.5. It still works! I was able to get the following.
Top 5 probabilities and labels:
0.52728 basketball
0.29820 streetball
0.02856 greco-roman wrestling
0.02103 freestyle wrestling
0.01411 wrestling
Also, after I applied the mean subtraction mentioned here #4 I was able to get the following.
Top 5:
0.84939 basketball
0.07358 streetball
0.01868 greco-roman wrestling
0.01477 freestyle wrestling
0.00922 volleyball
Hope this helps!
Ok I noticed the mean provided in #4 is actually wrong. That mean is for ImageNet. The mean should be from Sports1M.
Luckily, I found mean file is here : https://github.com/albertomontesg/keras-model-zoo/blob/master/kerasmodelzoo/data/c3d_mean.npy This should originally from https://github.com/facebook/C3D/blob/master/C3D-v1.0/examples/c3d_feature_extraction/sport1m_train16_128_mean.binaryproto.
I checked c3d_mean.npy
and found the shape of (1, 3, 16, 128, 171). I computed the channel-wise mean, which is (90.25, 97.66, 101.41) in BGR order.
In short, I guess we should add:
X = get_sport_clip('roger')
X = Variable(X)
X.data[:, 0, :, :, :] -= 101.41 # R channel
X.data[:, 1, :, :, :] -= 97.66 # G channel
X.data[:, 2, :, :, :] -= 90.25 # B channel
X = X[:, [2,1,0], :, :, :] # channel swap
X = X.cuda()
After all these changes, I am getting
Top 5:
0.99994 tennis
0.00005 padel tennis
0.00001 soft tennis
0.00000 pickleball
0.00000 match play
for the images in this repo, and
Top 5:
0.83607 basketball
0.07075 streetball
0.02308 greco-roman wrestling
0.01863 freestyle wrestling
0.01521 volleyball
for the video dM06AMFLsrc.mp4
P.S. I checked several other C3D repos imported from caffe but it seems like most do not correctly care the mean subtraction and BGR ordering....
Hi everybody,
The last post of @apple2373 is helpful.
The original mean file, computed on Sports1M, provided in many C3D Caffe repos, is of size 3x16x128x171 (channels x frames x height x width). In these repos, a way to preprocess any video volume, is to 1) resize every frame to 128x171 resolution 2) subtract the mean from the video volume 3) center-crop the video volume to 112x112 by keeping the pixels [8:120, 30:142]. By following this strategy, I get the following results:
Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling
Tennis clip
Top 5:
0.99995 tennis
0.00003 padel tennis
0.00001 pickleball
0.00001 soft tennis
0.00000 badminton
This seems like a more fair reproduction of the original C3D caffe repo, as it does not compute a single channel-wise mean value across all spatial locations and frames. However, there is the prerequisite of resizing the frames in 128x171 before proceeding.
A thing that needs to be clarified, is the order of the channels in the mean file, and the order of the channels in the expected image to be fed to C3D. I got the aforementioned results, loading an image in RGB ordered channels, subtracting the mean file as is, and feeding the image as is. When reordering the first and third channel of the mean, the results (see below) weren't disappointing, so I wouldn't exclude the possibility that the mean file is in BGR ordered channels. Any idea?
Here is a list of these tests (all of them including resize to 128x171 and center-crop to 112x112), in the following results:
Reorder image from RGB to BGR, then subtract mean as is:
Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling
Tennis clip
Top 5:
0.19663 tennis
0.10200 powerbocking
0.06381 sepak takraw
0.04477 soft tennis
0.03622 aggressive inline skating
0.00000 badminton
Reorder first and third mean channels, then subtract mean:
Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling
Tennis clip
Top 5:
0.99985 tennis
0.00006 padel tennis
0.00006 pickleball
0.00002 soft tennis
0.00000 bowls
Subtract mean, reorder cropped image's channels from RGB to BGR:
Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling
Tennis clip
Top 5:
0.23638 tennis
0.06171 powerbocking
0.05763 sepak takraw
0.05232 soft tennis
0.04792 aggressive inline skating
Thank you all for your comments.
I guess the only way to validate preprocessing is to measure test set accuracy on Sports1M. Monitoring softmax scores for a few just a couple sample clips can be misleading :(
Taking a quick peek into the dataset, it seems like a non-trivial task. I hopefully will get some time to do it in the near-mid future.
D
I made a mistake in the stride and padding size of the pool5 layer. And now the results are correct! Small parameters, huge impact! Thank you very much to your reply. Good luck!
@BarryBA could you please let me know your corresponding stride and padding size of the pool5 layer?.
@DavideA
Firstly, thank you this c3d implementation in pytorch! I am trying to fine tune the given model till the FC6 layer. Following is the implementation of code -
net = C3D() net.load_state_dict(torch.load('c3d.pickle')) net = nn.Sequential(*list(net.children())[:-5]) output= net(X)
I land up with an error as follows: -
RuntimeError: size mismatch, m1: [2048 x 4], m2: [8192 x 4096] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:266
Am I doing something wrong? Any help would be appreciated
@sampycool
It seems like you cropped the network to fc6. That means you dropped the classifier of the model, so you cannot make predictions.
One way you can fine-tune up to fc6 is exploit the torch.optim.Optimizer
interface.
Instead of doing this:
opt = torch.optim.ADAM(net.parameters(), lr=0.001)
you do this:
opt = torch.optim.ADAM(chain(net.fc6.parameters(),net.fc7.parameters(),net.fc8.parameters()), lr=0.001)
or something equivalent.
Hope this helps, D
@DavideA
Thank you for your reply, you are right , I do not want the classification part of the model. I just want the 4096 vector which is the output of the fc6 layer. The 4096 representation vector will then serve as the input vector to my network.
So I just need the weights till the fc6 layer and apply the weights to the video input to get the 4096 feature vector...Here is my entire code
` import numpy as np import torch.nn as nn import torch from torch.autograd import Variable
from C3D_Model_RTA import C3D
class C3D_Model(nn.Module): activation = {} def init(self): super(C3D_Model, self).init() net_c3d = C3D() net_c3d.load_state_dict(torch.load('c3d.pickle')) modules = list(net_c3d.children())[:-5] self.new_model = nn.Sequential(*modules)
def forward(self, x):
"""Extract feature vectors from input images."""
features = self.new_model(x)
return features
def c3Dfeatures(vector): X = Variable(torch.Tensor(vector)) X = X.cuda()
# get network pretrained model
net = C3D_Model()
# net = C3D()
# net.load_state_dict(torch.load('c3d.pickle'))
# net = nn.Sequential(*list(net.children())[:-5])
# for p in net.parameters():
# p.requires_grad = False
net.cuda()
print(net)
output= net(X)
print("output type and shape : ", np.shape(output))
data_reshaped = np.load('pickle file') # load the pickle file of the video
no_of_groups = data_reshaped.shape[1] no_of_groups = (int)(np.true_divide(data_reshaped.shape[1], 16)) print(no_of_groups) no_of_frames =16 new_frame_data = np.zeros([1,3,16,112,112]) cnt =0 for i in range(0,no_of_groups * 16,16): # print(i) cnt = cnt +1 new_frame_data = data_reshaped[:, i:i + no_of_frames, :, :] new_frame_data = np.expand_dims(new_frame_data, axis= 0) prediction = c3Dfeatures(new_frame_data)
`
I get the same error as mentioned in my previous comment. i.e.
RuntimeError: size mismatch, m1: [2048 x 4], m2: [8192 x 4096] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:266
It is my intuition that somewhere it is getting an error flattening/reshaping from the pool5 layer to 8192 vector. Your thoughts and suggestions is much appreciated.
Hi everybody,
The last post of @apple2373 is helpful.
The original mean file, computed on Sports1M, provided in many C3D Caffe repos, is of size 3x16x128x171 (channels x frames x height x width). In these repos, a way to preprocess any video volume, is to 1) resize every frame to 128x171 resolution 2) subtract the mean from the video volume 3) center-crop the video volume to 112x112 by keeping the pixels [8:120, 30:142]. By following this strategy, I get the following results:
Basket clip Top 5: 0.84280 basketball 0.06940 streetball 0.02143 volleyball 0.01706 greco-roman wrestling 0.01373 freestyle wrestling Tennis clip Top 5: 0.99995 tennis 0.00003 padel tennis 0.00001 pickleball 0.00001 soft tennis 0.00000 badminton
This seems like a more fair reproduction of the original C3D caffe repo, as it does not compute a single channel-wise mean value across all spatial locations and frames. However, there is the prerequisite of resizing the frames in 128x171 before proceeding.
A thing that needs to be clarified, is the order of the channels in the mean file, and the order of the channels in the expected image to be fed to C3D. I got the aforementioned results, loading an image in RGB ordered channels, subtracting the mean file as is, and feeding the image as is. When reordering the first and third channel of the mean, the results (see below) weren't disappointing, so I wouldn't exclude the possibility that the mean file is in BGR ordered channels. Any idea?
Here is a list of these tests (all of them including resize to 128x171 and center-crop to 112x112), in the following results:
Reorder image from RGB to BGR, then subtract mean as is:
Basket clip Top 5: 0.84280 basketball 0.06940 streetball 0.02143 volleyball 0.01706 greco-roman wrestling 0.01373 freestyle wrestling Tennis clip Top 5: 0.19663 tennis 0.10200 powerbocking 0.06381 sepak takraw 0.04477 soft tennis 0.03622 aggressive inline skating 0.00000 badminton
Reorder first and third mean channels, then subtract mean:
Basket clip Top 5: 0.75358 basketball 0.10351 streetball 0.06247 volleyball 0.01341 greco-roman wrestling 0.01107 freestyle wrestling Tennis clip Top 5: 0.99985 tennis 0.00006 padel tennis 0.00006 pickleball 0.00002 soft tennis 0.00000 bowls
Subtract mean, reorder cropped image's channels from RGB to BGR:
Basket clip Top 5: 0.75358 basketball 0.10351 streetball 0.06247 volleyball 0.01341 greco-roman wrestling 0.01107 freestyle wrestling Tennis clip Top 5: 0.23638 tennis 0.06171 powerbocking 0.05763 sepak takraw 0.05232 soft tennis 0.04792 aggressive inline skating
Hi! I notice your results are the same with @apple2373 when you 'think' you are using RGB. Maybe you read image with opencv, and it will be in BGR order. So is it possible that you are actually using BGR but you suppose it is RGB?
Hi guys, thanks for the explanations about the normalization. In this case, I tried the normalization method as followed. And the result is pretty good. If you are interested, you can definitely give this a try, too! :)
With Normalization& Channel swap:
import torchvision.transforms as transforms
X = get_sport_clip('roger')
X = Variable(X)
X.data = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))(X.data.permute(0, 2, 1, 3, 4))
X = X.data.permute(0, 2, 1, 3, 4)[:, [2,1,0], :, :, :] # channel swap
X = X.cuda()
Results:
Top 5:
1.00000 tennis
0.00000 padel tennis
0.00000 pickleball
0.00000 soft tennis
0.00000 match play
Only with Normalization, no channel swap:
import torchvision.transforms as transforms
X = get_sport_clip('roger')
X = Variable(X)
X.data = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))(X.data.permute(0, 2, 1, 3, 4))
X = X.data.permute(0, 2, 1, 3, 4)
X = X.cuda()
Results:
Top 5:
0.99993 tennis
0.00004 padel tennis
0.00002 soft tennis
0.00000 pickleball
0.00000 squash (sport)
Any conclusion regarding how to properly feed the model with RGB clips? What are the correct normalization and cropping steps?