c3d-pytorch icon indicating copy to clipboard operation
c3d-pytorch copied to clipboard

pretrained weights

Open hassony2 opened this issue 6 years ago • 21 comments

Hi,

First, thank you very much for contributing this c3d implementation in pytorch! I had a question on the origin of the pretrained weights, did you obtain them by converting them from another source or by training the network yourself ?

hassony2 avatar Aug 01 '17 08:08 hassony2

Hi! Thanks for interest. Here's the thing:

I started from the weights provided in keras by this gist. Such weights were ported from the original caffe repo.

In the keras porting gist, a easy prediction code is presented along with the results, that I copy paste here:

Top 5 probabilities and labels: 0.45910 basketball 0.39566 streetball 0.02090 greco-roman wrestling 0.01479 freestyle wrestling 0.01391 slamball

Now, if I run the Keras code loading the weights provided, I obtain

Top 5 probabilities and labels: 0.52728 basketball 0.29820 streetball 0.02856 greco-roman wrestling 0.02103 freestyle wrestling 0.01411 wrestling

So, something in the keras weights drifted a bit. My pytorch porting yields these last results. This means that my porting keras->pytorch is correct, but the my-keras, gist-keras discrepancy is propagated to pytorch as well :(

Hope I made the issue clear.

I tried to downgrade my keras to many older versions (back 'till 1.0.2), but still obtained the same results.

I leave this issue opened since I'm still trying to align perfectly to keras gist. Happy to get insights from you or anyone else.

Best, D

DavideA avatar Aug 03 '17 15:08 DavideA

I saw your predict.py, but I didn't find the mean subtraction step. Maybe this is the reason !

happygds avatar Dec 27 '17 06:12 happygds

Has this been resolved by @happygds solution? Or better what kind of preprocessing do I have to do to use the weights?

hohmannr avatar Jul 10 '18 09:07 hohmannr

@DavideA I used your code with the weights of 'c3d.pickle' in PyTorch 0.4.0 to predict the action of Roger Federer. But I receive the results like that: Top 5: 0.11345 backpacking (wilderness) 0.05290 hiking 0.05005 longboarding 0.02464 base jumping 0.02359 whitewater kayaking it's a lot different from your results. Do you know the reasons? Is it because that we didn't use the mean subtraction step? @happygds

BarryBA avatar Sep 06 '18 09:09 BarryBA

@BarryBA I just ran the prediction script (without removing the mean) with PyTorch 0.4.0 and it provides me the correct results.

Really weird.

DavideA avatar Sep 06 '18 15:09 DavideA

I made a mistake in the stride and padding size of the pool5 layer. And now the results are correct! Small parameters, huge impact! Thank you very much to your reply. Good luck!

BarryBA avatar Sep 12 '18 07:09 BarryBA

Hi, Thanks for the impelemetation. I dont know how transfer the weight in kears to pytorch(.h5-->pickle),could you show me your source code? another question is that you dont sub the mean in your prediect.py ,does it means that you dont sub the mean when you train too?
Thanks

JJBOY avatar Nov 01 '18 02:11 JJBOY

Hi and thanks for interest.

I can share with you the snippet to save parameters from keras (v=1.2.2) to file:

# save weights
import os.path
import numpy as np
from keras.utils.conv_utils import convert_kernel
import os
out_dir = 'layers_weights'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
for l in model.layers:
    layer_name = l.name
    if l.weights:
        w, b = l.get_weights()
        np.save(os.path.join(out_dir, layer_name + '_w.npy'), np.array(convert_kernel(w)))
        np.save(os.path.join(out_dir, layer_name + '_b.npy'), np.array(b))

Then as you build the pytorch model you can load them into the .data attribute of each variable.

Concerning the mean: if you train subtracting the mean, you test subtracting the mean. Otherwise you don't. I am not sure whether in the original caffe implementation this step was performed or not, I just wanted to reproduce the keras gist mentioned above.

Hope this helps, D

DavideA avatar Nov 01 '18 08:11 DavideA

Hi and thanks for interest.

I can share with you the snippet to save parameters from keras (v=1.2.2) to file:

# save weights
import os.path
import numpy as np
from keras.utils.conv_utils import convert_kernel
import os
out_dir = 'layers_weights'
if not os.path.exists(out_dir):
    os.makedirs(out_dir)
for l in model.layers:
    layer_name = l.name
    if l.weights:
        w, b = l.get_weights()
        np.save(os.path.join(out_dir, layer_name + '_w.npy'), np.array(convert_kernel(w)))
        np.save(os.path.join(out_dir, layer_name + '_b.npy'), np.array(b))

Then as you build the pytorch model you can load them into the .data attribute of each variable.

Concerning the mean: if you train subtracting the mean, you test subtracting the mean. Otherwise you don't. I am not sure whether in the original caffe implementation this step was performed or not, I just wanted to reproduce the keras gist mentioned above.

Hope this helps, D

Thank you.I know how to transfer the weight now. I considered that there was a function could transfer the weight from keras to pytorch directly but may not have such a function.

JJBOY avatar Nov 01 '18 08:11 JJBOY

hi, i am getting,

Top 5:
0.92100 tennis
0.01580 padel tennis
0.01240 softball
0.00891 soft tennis
0.00687 aggressive inline skating

EMCL avatar Dec 10 '18 04:12 EMCL

@EMCL you are right. if you just run the 'predict.py' on the video provided in this repository, you will get that. I also got the same.

The predicted discussed in this thread is not about that video. they are talking about the video used here: https://gist.github.com/albertomontesg/d8b21a179c1e6cca0480ebdf292c34d2

I tested it on pytorch 1.0 + cuda 9.2 + python 3.6.5. It still works! I was able to get the following.

Top 5 probabilities and labels:
0.52728 basketball
0.29820 streetball
0.02856 greco-roman wrestling
0.02103 freestyle wrestling
0.01411 wrestling

Also, after I applied the mean subtraction mentioned here #4 I was able to get the following.

Top 5:
0.84939 basketball
0.07358 streetball
0.01868 greco-roman wrestling
0.01477 freestyle wrestling
0.00922 volleyball

Hope this helps!

apple2373 avatar Feb 22 '19 06:02 apple2373

Ok I noticed the mean provided in #4 is actually wrong. That mean is for ImageNet. The mean should be from Sports1M.

Luckily, I found mean file is here : https://github.com/albertomontesg/keras-model-zoo/blob/master/kerasmodelzoo/data/c3d_mean.npy This should originally from https://github.com/facebook/C3D/blob/master/C3D-v1.0/examples/c3d_feature_extraction/sport1m_train16_128_mean.binaryproto.

I checked c3d_mean.npy and found the shape of (1, 3, 16, 128, 171). I computed the channel-wise mean, which is (90.25, 97.66, 101.41) in BGR order.

In short, I guess we should add:

X = get_sport_clip('roger')
X = Variable(X)
X.data[:, 0, :, :, :] -= 101.41 # R channel
X.data[:, 1, :, :, :] -= 97.66  # G channel
X.data[:, 2, :, :, :] -= 90.25 # B channel
X = X[:, [2,1,0], :, :, :] # channel swap
X = X.cuda()

After all these changes, I am getting

Top 5:
0.99994 tennis
0.00005 padel tennis
0.00001 soft tennis
0.00000 pickleball
0.00000 match play

for the images in this repo, and

Top 5:
0.83607 basketball
0.07075 streetball
0.02308 greco-roman wrestling
0.01863 freestyle wrestling
0.01521 volleyball

for the video dM06AMFLsrc.mp4

P.S. I checked several other C3D repos imported from caffe but it seems like most do not correctly care the mean subtraction and BGR ordering....

apple2373 avatar Feb 23 '19 05:02 apple2373

Hi everybody,

The last post of @apple2373 is helpful.

The original mean file, computed on Sports1M, provided in many C3D Caffe repos, is of size 3x16x128x171 (channels x frames x height x width). In these repos, a way to preprocess any video volume, is to 1) resize every frame to 128x171 resolution 2) subtract the mean from the video volume 3) center-crop the video volume to 112x112 by keeping the pixels [8:120, 30:142]. By following this strategy, I get the following results:

Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling

Tennis clip
Top 5:
0.99995 tennis
0.00003 padel tennis
0.00001 pickleball
0.00001 soft tennis
0.00000 badminton

This seems like a more fair reproduction of the original C3D caffe repo, as it does not compute a single channel-wise mean value across all spatial locations and frames. However, there is the prerequisite of resizing the frames in 128x171 before proceeding.

A thing that needs to be clarified, is the order of the channels in the mean file, and the order of the channels in the expected image to be fed to C3D. I got the aforementioned results, loading an image in RGB ordered channels, subtracting the mean file as is, and feeding the image as is. When reordering the first and third channel of the mean, the results (see below) weren't disappointing, so I wouldn't exclude the possibility that the mean file is in BGR ordered channels. Any idea?

Here is a list of these tests (all of them including resize to 128x171 and center-crop to 112x112), in the following results:

Reorder image from RGB to BGR, then subtract mean as is:

Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling

Tennis clip
Top 5:
0.19663 tennis
0.10200 powerbocking
0.06381 sepak takraw
0.04477 soft tennis
0.03622 aggressive inline skating
0.00000 badminton

Reorder first and third mean channels, then subtract mean:

Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling

Tennis clip
Top 5:
0.99985 tennis
0.00006 padel tennis
0.00006 pickleball
0.00002 soft tennis
0.00000 bowls

Subtract mean, reorder cropped image's channels from RGB to BGR:

Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling

Tennis clip
Top 5:
0.23638 tennis
0.06171 powerbocking
0.05763 sepak takraw
0.05232 soft tennis
0.04792 aggressive inline skating

gzoumpourlis avatar Mar 09 '19 22:03 gzoumpourlis

Thank you all for your comments.

I guess the only way to validate preprocessing is to measure test set accuracy on Sports1M. Monitoring softmax scores for a few just a couple sample clips can be misleading :(

Taking a quick peek into the dataset, it seems like a non-trivial task. I hopefully will get some time to do it in the near-mid future.

D

DavideA avatar Mar 11 '19 09:03 DavideA

I made a mistake in the stride and padding size of the pool5 layer. And now the results are correct! Small parameters, huge impact! Thank you very much to your reply. Good luck!

@BarryBA could you please let me know your corresponding stride and padding size of the pool5 layer?.

mrkstt avatar Mar 13 '19 09:03 mrkstt

@DavideA

Firstly, thank you this c3d implementation in pytorch! I am trying to fine tune the given model till the FC6 layer. Following is the implementation of code -

net = C3D() net.load_state_dict(torch.load('c3d.pickle')) net = nn.Sequential(*list(net.children())[:-5]) output= net(X)

I land up with an error as follows: -

RuntimeError: size mismatch, m1: [2048 x 4], m2: [8192 x 4096] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:266

Am I doing something wrong? Any help would be appreciated

sampycool avatar Mar 26 '19 02:03 sampycool

@sampycool

It seems like you cropped the network to fc6. That means you dropped the classifier of the model, so you cannot make predictions.

One way you can fine-tune up to fc6 is exploit the torch.optim.Optimizer interface.

Instead of doing this:

opt = torch.optim.ADAM(net.parameters(), lr=0.001)

you do this:

opt = torch.optim.ADAM(chain(net.fc6.parameters(),net.fc7.parameters(),net.fc8.parameters()), lr=0.001)

or something equivalent.

Hope this helps, D

DavideA avatar Mar 26 '19 07:03 DavideA

@DavideA

Thank you for your reply, you are right , I do not want the classification part of the model. I just want the 4096 vector which is the output of the fc6 layer. The 4096 representation vector will then serve as the input vector to my network.

So I just need the weights till the fc6 layer and apply the weights to the video input to get the 4096 feature vector...Here is my entire code

` import numpy as np import torch.nn as nn import torch from torch.autograd import Variable

from C3D_Model_RTA import C3D

class C3D_Model(nn.Module): activation = {} def init(self): super(C3D_Model, self).init() net_c3d = C3D() net_c3d.load_state_dict(torch.load('c3d.pickle')) modules = list(net_c3d.children())[:-5] self.new_model = nn.Sequential(*modules)

def forward(self, x):
    """Extract feature vectors from input images."""
    features = self.new_model(x)
    return features

def c3Dfeatures(vector): X = Variable(torch.Tensor(vector)) X = X.cuda()

# get network pretrained model
net = C3D_Model()
# net = C3D()
# net.load_state_dict(torch.load('c3d.pickle'))
# net = nn.Sequential(*list(net.children())[:-5])
# for p in net.parameters():
#     p.requires_grad = False
net.cuda()
print(net)
output= net(X)
print("output type and shape : ", np.shape(output))

data_reshaped = np.load('pickle file') # load the pickle file of the video

no_of_groups = data_reshaped.shape[1] no_of_groups = (int)(np.true_divide(data_reshaped.shape[1], 16)) print(no_of_groups) no_of_frames =16 new_frame_data = np.zeros([1,3,16,112,112]) cnt =0 for i in range(0,no_of_groups * 16,16): # print(i) cnt = cnt +1 new_frame_data = data_reshaped[:, i:i + no_of_frames, :, :] new_frame_data = np.expand_dims(new_frame_data, axis= 0) prediction = c3Dfeatures(new_frame_data)

`

I get the same error as mentioned in my previous comment. i.e.

RuntimeError: size mismatch, m1: [2048 x 4], m2: [8192 x 4096] at /pytorch/aten/src/THC/generic/THCTensorMathBlas.cu:266

It is my intuition that somewhere it is getting an error flattening/reshaping from the pool5 layer to 8192 vector. Your thoughts and suggestions is much appreciated.

sampycool avatar Mar 26 '19 16:03 sampycool

Hi everybody,

The last post of @apple2373 is helpful.

The original mean file, computed on Sports1M, provided in many C3D Caffe repos, is of size 3x16x128x171 (channels x frames x height x width). In these repos, a way to preprocess any video volume, is to 1) resize every frame to 128x171 resolution 2) subtract the mean from the video volume 3) center-crop the video volume to 112x112 by keeping the pixels [8:120, 30:142]. By following this strategy, I get the following results:

Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling

Tennis clip
Top 5:
0.99995 tennis
0.00003 padel tennis
0.00001 pickleball
0.00001 soft tennis
0.00000 badminton

This seems like a more fair reproduction of the original C3D caffe repo, as it does not compute a single channel-wise mean value across all spatial locations and frames. However, there is the prerequisite of resizing the frames in 128x171 before proceeding.

A thing that needs to be clarified, is the order of the channels in the mean file, and the order of the channels in the expected image to be fed to C3D. I got the aforementioned results, loading an image in RGB ordered channels, subtracting the mean file as is, and feeding the image as is. When reordering the first and third channel of the mean, the results (see below) weren't disappointing, so I wouldn't exclude the possibility that the mean file is in BGR ordered channels. Any idea?

Here is a list of these tests (all of them including resize to 128x171 and center-crop to 112x112), in the following results:

Reorder image from RGB to BGR, then subtract mean as is:

Basket clip
Top 5:
0.84280 basketball
0.06940 streetball
0.02143 volleyball
0.01706 greco-roman wrestling
0.01373 freestyle wrestling

Tennis clip
Top 5:
0.19663 tennis
0.10200 powerbocking
0.06381 sepak takraw
0.04477 soft tennis
0.03622 aggressive inline skating
0.00000 badminton

Reorder first and third mean channels, then subtract mean:

Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling

Tennis clip
Top 5:
0.99985 tennis
0.00006 padel tennis
0.00006 pickleball
0.00002 soft tennis
0.00000 bowls

Subtract mean, reorder cropped image's channels from RGB to BGR:

Basket clip
Top 5:
0.75358 basketball
0.10351 streetball
0.06247 volleyball
0.01341 greco-roman wrestling
0.01107 freestyle wrestling

Tennis clip
Top 5:
0.23638 tennis
0.06171 powerbocking
0.05763 sepak takraw
0.05232 soft tennis
0.04792 aggressive inline skating

Hi! I notice your results are the same with @apple2373 when you 'think' you are using RGB. Maybe you read image with opencv, and it will be in BGR order. So is it possible that you are actually using BGR but you suppose it is RGB?

gdbb avatar Dec 08 '20 22:12 gdbb

Hi guys, thanks for the explanations about the normalization. In this case, I tried the normalization method as followed. And the result is pretty good. If you are interested, you can definitely give this a try, too! :)

With Normalization& Channel swap:

    import torchvision.transforms as transforms
  
    X = get_sport_clip('roger')
    X = Variable(X)
    X.data = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))(X.data.permute(0, 2, 1, 3, 4))
    X = X.data.permute(0, 2, 1, 3, 4)[:, [2,1,0], :, :, :] # channel swap
    X = X.cuda()

Results:

Top 5:
1.00000 tennis
0.00000 padel tennis
0.00000 pickleball
0.00000 soft tennis
0.00000 match play

Only with Normalization, no channel swap:

    import torchvision.transforms as transforms
  
    X = get_sport_clip('roger')
    X = Variable(X)
    X.data = transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))(X.data.permute(0, 2, 1, 3, 4))
    X = X.data.permute(0, 2, 1, 3, 4)
    X = X.cuda()

Results:

Top 5:
0.99993 tennis
0.00004 padel tennis
0.00002 soft tennis
0.00000 pickleball
0.00000 squash (sport)

foxingcoco avatar May 09 '21 21:05 foxingcoco

Any conclusion regarding how to properly feed the model with RGB clips? What are the correct normalization and cropping steps?

ekosman avatar Jul 02 '21 21:07 ekosman