two-stream-action-recognition icon indicating copy to clipboard operation
two-stream-action-recognition copied to clipboard

Improvement on motion-cnn result: 84.1% on split-1, with VGG-16

Open gaosh opened this issue 6 years ago • 42 comments

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

gaosh avatar Sep 11 '18 23:09 gaosh

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

imnotk avatar Sep 13 '18 07:09 imnotk

You should use this data augmentation during training to get desired results.

zhujian [email protected]于2018年9月13日 周四03:15写道:

I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffreyhuang1/two-stream-action-recognition/issues/39#issuecomment-420907955, or mute the thread https://github.com/notifications/unsubscribe-auth/AQSH_idku3ui1Nlmd1uIAvNTvphSV6-rks5uagYFgaJpZM4WkYD8 .

gaosh avatar Sep 13 '18 19:09 gaosh

You can refer to the related paper, during training there are extensive data augmentation used such as multi-scale, corner crops, etc. The author of this project only used very simple data augmentation.

gaosh avatar Sep 13 '18 19:09 gaosh

I use the augmentation on train split ,but when used for test split the accuracy is far below 80%,but only use center crop for test split is almost 80%,maybe some difference between tf and pytorch

imnotk avatar Sep 14 '18 00:09 imnotk

That's weird, you can try these two models I converted from the project of their paper https://github.com/feichtenhofer/twostreamfusion, the link for the models https://drive.google.com/file/d/1JydxdPMEHU7uJnRyi8A8uF82jSgE9FGe/view?usp=sharing. They are VGG-16 models.

gaosh avatar Sep 14 '18 19:09 gaosh

I have choose the Only Testing. but the result shows that it still train data. it is so weird. can any one give me some tips.

sxzy avatar Oct 10 '18 10:10 sxzy

What is the results you got? You'd better open an new issue to discuss about this.

gaosh avatar Oct 10 '18 20:10 gaosh

Hi, all

I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop() to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop().

I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.

FiveCrop

@gaosh hello .I ready to add this trick you have mentioned. but I am confused. this is the official docs 's way to use fiveCrop

transform = Compose([
         >>>    FiveCrop(size), # this is a list of PIL Images
         >>>    Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor

and I am confused that in the code .we have done some augementation like

      training_set = spatial_dataset(dic=self.dic_training, root_dir=self.data_path, mode='train', transform = transforms.Compose([
                transforms.RandomCrop(224),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
                ]))

and I wonder how I add fiveCrop into it ???

sxzy avatar Oct 22 '18 12:10 sxzy

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

gaosh avatar Oct 24 '18 04:10 gaosh

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

thank you . you are really great

sxzy avatar Oct 25 '18 05:10 sxzy

I think you need to use lambda expression, for example:

transforms.Compose([
                 transforms.Resize(256),
                 transforms.FiveCrop([224, 224]),
                 transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
                     mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
             ])

and I also wonder how high is your accuracy in spatial part. I have tried to pretrain with vgg16 in spatial part.....but it seems that the result is not so satisfied..... just get around 72%

sxzy avatar Oct 25 '18 07:10 sxzy

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

gaosh avatar Oct 26 '18 18:10 gaosh

I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.

as you have mentioned . you have trained from other projects whose name is the two-stream fusion . and can you share your code about this project in pytorch ? I have noticed that you have share the pre-train model ,that is appreciate. and I wonder if you can also share the code about this project two-stream fusion . I have read the paper and the implementation is complicated for me now. so I will appreciate if you can share your pytorch code about this project

sxzy avatar Nov 06 '18 15:11 sxzy

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

gaosh avatar Nov 06 '18 19:11 gaosh

Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.

Ok .looking forward for the new post. and good luck in CVPR

sxzy avatar Nov 09 '18 11:11 sxzy

How do you achieve accuracy around 80? When I train the network, the validation loss oscillates and never really improves? What is the measurement of accuracy here? and like @sxzy I can't even run the test only. We can not use validation as is to pass as test because the parameters are updated (so it is part of training) I get these problems (both training and not being able to test only) even when I use the pretrained model and resume with it.

duygusar avatar Mar 04 '19 14:03 duygusar

@gaosh have you also augmented the motion data? The authors does not, and I would assume it would not be wise to do so because we would lose the motion information -alas- I need to reduce overfitting.

duygusar avatar Mar 05 '19 12:03 duygusar

@duygusar the motion data is also augmented. I think the authors of several early action recognition papers suggest to augment motion data, since they are tend to over-fitting if no augmentation is applied. I am quite certain that corner clip will improve the results. I use corner clip and achieve 59.9% accuracy on HMDB-51, without corner clip, it's around 57.3%, the result is based on model pretrained on ImageNet.

gaosh avatar Mar 05 '19 20:03 gaosh

@gaosh Thanks. I used randomcrop for traning motion data (and centercrop for the evaluation data) and then normalized the data to [0,1], now I don't get crazy jumps and downs in my validation loss, but the precision I get is 60-70% (resnet/for the first 6 classes of UCF101 / which should be much higher than UCF101 overall, and it is small but balanced enough to train without overfitting). Isn't your accuracy of UCF101 (around 80) overfitting?? When I run the code as is, I do get 80 and above (for 6 classes) but the network does not really converge and it would be a false measure without handling the problems of cyclical jumps and downs of validation loss and overfitting, no?

duygusar avatar Mar 06 '19 13:03 duygusar

@duygusar you don't have to worry too much about over-fitting at the beginning. 60-70% accuracy is lower than expected and I think it's irrelevant to overfitting, just train longer and tracking the changes of training loss. Also, if you use small models like resnet-18, the final performance will be lower than reported results in this repo.

gaosh avatar Mar 06 '19 21:03 gaosh

@gaosh when I shuffle the evaluation set I get low accuracy, It is around 80 but overfitting when I don't shuffle (in the repository it is not shuffled). And I can tell that it overfits because validation loss just won't go down after a while, and definitely does not converge even with smaller learning rates. By the way, in the repository I think the test set refers to evaluation set, is this correct? The evaluation set is not partitioned from the training set right? I am skimming through the code and I think test actually refers to evaluations set and if you needed an actual test you need to replace the test split with a new one (with unseen examples), I just found it peculiar and wanted to make sure if I am correct about this. So, I am confused about the reported accuracy because they don't provide a real test split. Is the accuracy on README the validation accuracy?

duygusar avatar Mar 08 '19 10:03 duygusar

@duygusar The validation set in this code is different from training set. I am not sure why you need shuffle validation set, but shuffle should not affect performance.

gaosh avatar Mar 12 '19 20:03 gaosh

@gaosh You are right, I don't need to shuffle as it is irrelevant but it does change the performance and I don't know why. The over-fitting remains either way (validation accuracy might be high but val loss does not converge), and I think the performance reported might be on validation set.

duygusar avatar Mar 13 '19 11:03 duygusar

@duygusar If val loss first go down and then go up, it may related to overfitting. However, if the val loss go down and stay at a certain value, even though the value is higher than training loss, it's common.

gaosh avatar Mar 14 '19 18:03 gaosh

I train the model with pretrained ResNet152, but I got the accuary only 30+%, I think it's too low, but I don't know how to imporve it. I use the open-source function of opencv to got my flow pics, may this causes the low accuary?

DoubleYing avatar Mar 20 '19 01:03 DoubleYing

@DoubleYing Have you changed the number of classes accordingly? UCF has 101 classes, what is the number of classes for your dataset? opencv's flow is not great but I think it shouldn't make a huge difference.

duygusar avatar Mar 20 '19 15:03 duygusar

yes, I have changed the classes number, and now I'm considering to change a way to extract flow. If I get a good result later, I will note here. Thanks for your answer.

DoubleYing avatar Mar 20 '19 15:03 DoubleYing

@DoubleYing On my dataset, which should be somewhat easy and balanced, I also get lower accuracies for motion, I also use cv2's farneback (because it is easy and fast, I can change to a course to fine one though I prefer a faster algorithm, I will just skip the deep learning one they used because I have limited time before a deadline :( ). Did you manage to improve your results? @gaosh do you have any references to your changes on the motion-cnn part (especially motion dataloader, but if possible VGG modifications on network part too)? I would really appreciate if you can refer me to your changes. Getting 5 random crops, I should handle a tuple of images instead of a PIL image (TypeError: pic should be PIL Image or ndarray. Got <type 'tuple'>), I am kind of confused on how to go around that in the program in train/test, and there is also the channels, how to stack fivecrops...

duygusar avatar Mar 24 '19 09:03 duygusar

@gaosh Using Lambda, I get the error, at line 55, in stackopf flow[2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[224, 224]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (4)

and when I try to set flow = torch.FloatTensor(5, 2*self.in_channel,self.img_rows,self.img_cols)

I get motion_dataloader.py", line 55, in stackopf flow[:,2*(j),:,:] = H RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[5, 224, 224]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)

when I multiply the train batchsize by 5 that is returned, I also get the same error.

duygusar avatar Mar 24 '19 12:03 duygusar

Your also need to modify the code within motion_dataloader.py.

def stackopf(self, video_name, clip_idx, nb_clips=None):
        name = 'v_' + video_name
        u = self.flow_root_dir + 'u/' + name
        v = self.flow_root_dir + 'v/' + name

        if self.fiveCrops:
            self.ncrops = 5
        else:
            self.ncrops = 1

        flow = torch.FloatTensor(self.ncrops, 2 * self.in_channel, self.img_rows, self.img_cols)
        #i = int(self.clips_idx)
        i = clip_idx

        for j in range(self.in_channel):
            idx = i + j
            if self.mode == 'train':
                if idx >= nb_clips+1:
                    idx = nb_clips+1
            idx = str(idx)


            frame_idx = 'frame' + idx.zfill(6)
            h_image = u + '/' + frame_idx + '.jpg'
            v_image = v + '/' + frame_idx + '.jpg'

            imgH = (Image.open(h_image))
            imgV = (Image.open(v_image))

            H = self.flow_transform(imgH)
            V = self.flow_transform(imgV)

         
            if self.fiveCrops:
                flow[:, 2 * (j - 1), :, :] = H.squeeze()
                flow[:, 2 * (j - 1) + 1, :, :] = V.squeeze()
            else:
                flow[:, 2 * (j - 1), :, :] = H
                flow[:, 2 * (j - 1) + 1, :, :] = V

            imgH.close()
            imgV.close()

        return flow.squeeze()

Please also notice that the returned image from dataloader will have size of (n_crops, batchsize, n_channels, height, weight) . You need to resize the batch to (n_crops*batchsize, n_channels, height, weight) . You can check official reference too.

gaosh avatar Mar 25 '19 14:03 gaosh