two-stream-action-recognition Improvement on motion-cnn result: 84.1% on split-1, with VGG-16

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

Sep 11 '18 23:09 gaosh

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

Sep 13 '18 07:09 imnotk

You should use this data augmentation during training to get desired results.

zhujian [email protected]于2018年9月13日周四03:15写道：

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffreyhuang1/two-stream-action-recognition/issues/39#issuecomment-420907955, or mute the thread https://github.com/notifications/unsubscribe-auth/AQSH_idku3ui1Nlmd1uIAvNTvphSV6-rks5uagYFgaJpZM4WkYD8 .

Sep 13 '18 19:09 gaosh

You can refer to the related paper, during training there are extensive data augmentation used such as multi-scale, corner crops, etc. The author of this project only used very simple data augmentation.

Sep 13 '18 19:09 gaosh

I use the augmentation on train split ,but when used for test split the accuracy is far below 80%,but only use center crop for test split is almost 80%,maybe some difference between tf and pytorch

Sep 14 '18 00:09 imnotk

That's weird, you can try these two models I converted from the project of their paper https://github.com/feichtenhofer/twostreamfusion, the link for the models https://drive.google.com/file/d/1JydxdPMEHU7uJnRyi8A8uF82jSgE9FGe/view?usp=sharing. They are VGG-16 models.

Sep 14 '18 19:09 gaosh

I have choose the Only Testing. but the result shows that it still train data. it is so weird. can any one give me some tips.

Oct 10 '18 10:10 sxzy

What is the results you got? You'd better open an new issue to discuss about this.

Oct 10 '18 20:10 gaosh

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

FiveCrop

@gaosh hello .I ready to add this trick you have mentioned. but I am confused. this is the official docs 's way to use fiveCrop

transform = Compose([
         >>>    FiveCrop(size), # this is a list of PIL Images
         >>>    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor

and I am confused that in the code .we have done some augementation like

      training_set = spatial_dataset(dic=self.dic_training, root_dir=self.data_path, mode='train', transform = transforms.Compose([
                transforms.RandomCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
                ]))

and I wonder how I add fiveCrop into it ???

Oct 22 '18 12:10 sxzy

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

Oct 24 '18 04:10 gaosh

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

thank you . you are really great

Oct 25 '18 05:10 sxzy

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

and I also wonder how high is your accuracy in spatial part. I have tried to pretrain with vgg16 in spatial part.....but it seems that the result is not so satisfied..... just get around 72%

Oct 25 '18 07:10 sxzy

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

Oct 26 '18 18:10 gaosh

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

as you have mentioned . you have trained from other projects whose name is the two-stream fusion . and can you share your code about this project in pytorch ? I have noticed that you have share the pre-train model ,that is appreciate. and I wonder if you can also share the code about this project two-stream fusion . I have read the paper and the implementation is complicated for me now. so I will appreciate if you can share your pytorch code about this project

Nov 06 '18 15:11 sxzy

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

Nov 06 '18 19:11 gaosh

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

Ok .looking forward for the new post. and good luck in CVPR

Nov 09 '18 11:11 sxzy

How do you achieve accuracy around 80? When I train the network, the validation loss oscillates and never really improves? What is the measurement of accuracy here? and like @sxzy I can't even run the test only. We can not use validation as is to pass as test because the parameters are updated (so it is part of training) I get these problems (both training and not being able to test only) even when I use the pretrained model and resume with it.

Mar 04 '19 14:03 duygusar

@gaosh have you also augmented the motion data? The authors does not, and I would assume it would not be wise to do so because we would lose the motion information -alas- I need to reduce overfitting.

Mar 05 '19 12:03 duygusar

@duygusar the motion data is also augmented. I think the authors of several early action recognition papers suggest to augment motion data, since they are tend to over-fitting if no augmentation is applied. I am quite certain that corner clip will improve the results. I use corner clip and achieve 59.9% accuracy on HMDB-51, without corner clip, it's around 57.3%, the result is based on model pretrained on ImageNet.

Mar 05 '19 20:03 gaosh

@gaosh Thanks. I used randomcrop for traning motion data (and centercrop for the evaluation data) and then normalized the data to [0,1], now I don't get crazy jumps and downs in my validation loss, but the precision I get is 60-70% (resnet/for the first 6 classes of UCF101 / which should be much higher than UCF101 overall, and it is small but balanced enough to train without overfitting). Isn't your accuracy of UCF101 (around 80) overfitting?? When I run the code as is, I do get 80 and above (for 6 classes) but the network does not really converge and it would be a false measure without handling the problems of cyclical jumps and downs of validation loss and overfitting, no?

Mar 06 '19 13:03 duygusar

@duygusar you don't have to worry too much about over-fitting at the beginning. 60-70% accuracy is lower than expected and I think it's irrelevant to overfitting, just train longer and tracking the changes of training loss. Also, if you use small models like resnet-18, the final performance will be lower than reported results in this repo.

Mar 06 '19 21:03 gaosh

@gaosh when I shuffle the evaluation set I get low accuracy, It is around 80 but overfitting when I don't shuffle (in the repository it is not shuffled). And I can tell that it overfits because validation loss just won't go down after a while, and definitely does not converge even with smaller learning rates. By the way, in the repository I think the test set refers to evaluation set, is this correct? The evaluation set is not partitioned from the training set right? I am skimming through the code and I think test actually refers to evaluations set and if you needed an actual test you need to replace the test split with a new one (with unseen examples), I just found it peculiar and wanted to make sure if I am correct about this. So, I am confused about the reported accuracy because they don't provide a real test split. Is the accuracy on README the validation accuracy?

Mar 08 '19 10:03 duygusar

@duygusar The validation set in this code is different from training set. I am not sure why you need shuffle validation set, but shuffle should not affect performance.

Mar 12 '19 20:03 gaosh

@gaosh You are right, I don't need to shuffle as it is irrelevant but it does change the performance and I don't know why. The over-fitting remains either way (validation accuracy might be high but val loss does not converge), and I think the performance reported might be on validation set.

Mar 13 '19 11:03 duygusar

@duygusar If val loss first go down and then go up, it may related to overfitting. However, if the val loss go down and stay at a certain value, even though the value is higher than training loss, it's common.

Mar 14 '19 18:03 gaosh

I train the model with pretrained ResNet152, but I got the accuary only 30+%, I think it's too low, but I don't know how to imporve it. I use the open-source function of opencv to got my flow pics, may this causes the low accuary?

Mar 20 '19 01:03 DoubleYing

@DoubleYing Have you changed the number of classes accordingly? UCF has 101 classes, what is the number of classes for your dataset? opencv's flow is not great but I think it shouldn't make a huge difference.

Mar 20 '19 15:03 duygusar

yes, I have changed the classes number, and now I'm considering to change a way to extract flow. If I get a good result later, I will note here. Thanks for your answer.

Mar 20 '19 15:03 DoubleYing

@DoubleYing On my dataset, which should be somewhat easy and balanced, I also get lower accuracies for motion, I also use cv2's farneback (because it is easy and fast, I can change to a course to fine one though I prefer a faster algorithm, I will just skip the deep learning one they used because I have limited time before a deadline :( ). Did you manage to improve your results? @gaosh do you have any references to your changes on the motion-cnn part (especially motion dataloader, but if possible VGG modifications on network part too)? I would really appreciate if you can refer me to your changes. Getting 5 random crops, I should handle a tuple of images instead of a PIL image (TypeError: pic should be PIL Image or ndarray. Got <type 'tuple'>), I am kind of confused on how to go around that in the program in train/test, and there is also the channels, how to stack fivecrops...

Mar 24 '19 09:03 duygusar

@gaosh Using Lambda, I get the error, at line 55, in stackopf flow[2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[224, 224]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (4)

and when I try to set flow = torch.FloatTensor(5, 2*self.in_channel,self.img_rows,self.img_cols)

I get motion_dataloader.py", line 55, in stackopf flow[:,2*(j),:,:] = H RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[5, 224, 224]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)

when I multiply the train batchsize by 5 that is returned, I also get the same error.

Mar 24 '19 12:03 duygusar

Your also need to modify the code within motion_dataloader.py.

def stackopf(self, video_name, clip_idx, nb_clips=None):
        name = 'v_' + video_name
        u = self.flow_root_dir + 'u/' + name
        v = self.flow_root_dir + 'v/' + name

        if self.fiveCrops:
            self.ncrops = 5
        else:
            self.ncrops = 1

        flow = torch.FloatTensor(self.ncrops, 2 * self.in_channel, self.img_rows, self.img_cols)
        #i = int(self.clips_idx)
        i = clip_idx

        for j in range(self.in_channel):
            idx = i + j
            if self.mode == 'train':
                if idx >= nb_clips+1:
                    idx = nb_clips+1
            idx = str(idx)


            frame_idx = 'frame' + idx.zfill(6)
            h_image = u + '/' + frame_idx + '.jpg'
            v_image = v + '/' + frame_idx + '.jpg'

            imgH = (Image.open(h_image))
            imgV = (Image.open(v_image))

            H = self.flow_transform(imgH)
            V = self.flow_transform(imgV)

         
            if self.fiveCrops:
                flow[:, 2 * (j - 1), :, :] = H.squeeze()
                flow[:, 2 * (j - 1) + 1, :, :] = V.squeeze()
            else:
                flow[:, 2 * (j - 1), :, :] = H
                flow[:, 2 * (j - 1) + 1, :, :] = V

            imgH.close()
            imgV.close()

        return flow.squeeze()

Please also notice that the returned image from dataloader will have size of (n_crops, batchsize, n_channels, height, weight) . You need to resize the batch to (n_crops*batchsize, n_channels, height, weight) . You can check official reference too.

Mar 25 '19 14:03 gaosh

two-stream-action-recognition two-stream-action-recognition copied to clipboard

Improvement on motion-cnn result: 84.1% on split-1, with VGG-16

two-stream-action-recognition
two-stream-action-recognition copied to clipboard