two-stream-action-recognition
two-stream-action-recognition copied to clipboard
Improvement on motion-cnn result: 84.1% on split-1, with VGG-16
Hi, all
I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding transforms.FiveCrop()
to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down with transfroms.TenCrop()
.
I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.
I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?
You should use this data augmentation during training to get desired results.
zhujian [email protected]于2018年9月13日 周四03:15写道:
I have some problem about the accuracy , when i only use the center crop (224,224) with sample 25 frames, I can get about 80% on rgb modality , but when I use Five crop or ten crop , my accuracy decreased a lot whatever cnn net like resnet,inceptionv1,inceptionv2 . Can you explain why ?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/jeffreyhuang1/two-stream-action-recognition/issues/39#issuecomment-420907955, or mute the thread https://github.com/notifications/unsubscribe-auth/AQSH_idku3ui1Nlmd1uIAvNTvphSV6-rks5uagYFgaJpZM4WkYD8 .
You can refer to the related paper, during training there are extensive data augmentation used such as multi-scale, corner crops, etc. The author of this project only used very simple data augmentation.
I use the augmentation on train split ,but when used for test split the accuracy is far below 80%,but only use center crop for test split is almost 80%,maybe some difference between tf and pytorch
That's weird, you can try these two models I converted from the project of their paper https://github.com/feichtenhofer/twostreamfusion, the link for the models https://drive.google.com/file/d/1JydxdPMEHU7uJnRyi8A8uF82jSgE9FGe/view?usp=sharing. They are VGG-16 models.
I have choose the Only Testing. but the result shows that it still train data. it is so weird. can any one give me some tips.
What is the results you got? You'd better open an new issue to discuss about this.
Hi, all
I did some investigation on why the motion-cnn result is much lower than their original paper. After a simple modification, I am able to achieve 84.1% top-1 accuracy. This modification is adding
transforms.FiveCrop()
to the transformation. Before this modification, the result is only 80.5%. I use pretrained model fromhttps://github.com/feichtenhofer/twostreamfusion, I think further improvement can be down withtransfroms.TenCrop()
.I think with this modification, it can bridge the gap of performance between twostream model trained on pytorch and other frameworks.
FiveCrop
@gaosh hello .I ready to add this trick you have mentioned. but I am confused. this is the official docs 's way to use fiveCrop
transform = Compose([
>>> FiveCrop(size), # this is a list of PIL Images
>>> Lambda(lambda crops: torch.stack([ToTensor()(crop) for crop in crops])) # returns a 4D tensor
and I am confused that in the code .we have done some augementation like
training_set = spatial_dataset(dic=self.dic_training, root_dir=self.data_path, mode='train', transform = transforms.Compose([
transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],std=[0.229, 0.224, 0.225])
]))
and I wonder how I add fiveCrop into it ???
I think you need to use lambda expression, for example:
transforms.Compose([
transforms.Resize(256),
transforms.FiveCrop([224, 224]),
transforms.Lambda(lambda crops: torch.stack([transforms.Normalize(
mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops]))
])
I think you need to use lambda expression, for example:
transforms.Compose([ transforms.Resize(256), transforms.FiveCrop([224, 224]), transforms.Lambda(lambda crops: torch.stack([transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops])) ])
thank you . you are really great
I think you need to use lambda expression, for example:
transforms.Compose([ transforms.Resize(256), transforms.FiveCrop([224, 224]), transforms.Lambda(lambda crops: torch.stack([transforms.Normalize( mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])(crop) for crop in crops])) ])
and I also wonder how high is your accuracy in spatial part. I have tried to pretrain with vgg16 in spatial part.....but it seems that the result is not so satisfied..... just get around 72%
I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.
I used trained model from other projects, I provided converted pytorch model link in previous comments in this issue. When test with 5-crops/center-crop, I can achieve around 82%/78% accuracy with spatial part.
as you have mentioned . you have trained from other projects whose name is the two-stream fusion . and can you share your code about this project in pytorch ? I have noticed that you have share the pre-train model ,that is appreciate. and I wonder if you can also share the code about this project two-stream fusion . I have read the paper and the implementation is complicated for me now. so I will appreciate if you can share your pytorch code about this project
Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.
Right now, I may not have time for sharing my code. But after CVPR deadline, I will refine the code concerning this project and make it public available. Regarding two-stream fusion, I didn't implement their code in pytorch, I just converted their pretrained model into pytorch.
Ok .looking forward for the new post. and good luck in CVPR
How do you achieve accuracy around 80? When I train the network, the validation loss oscillates and never really improves? What is the measurement of accuracy here? and like @sxzy I can't even run the test only. We can not use validation as is to pass as test because the parameters are updated (so it is part of training) I get these problems (both training and not being able to test only) even when I use the pretrained model and resume with it.
@gaosh have you also augmented the motion data? The authors does not, and I would assume it would not be wise to do so because we would lose the motion information -alas- I need to reduce overfitting.
@duygusar the motion data is also augmented. I think the authors of several early action recognition papers suggest to augment motion data, since they are tend to over-fitting if no augmentation is applied. I am quite certain that corner clip will improve the results. I use corner clip and achieve 59.9% accuracy on HMDB-51, without corner clip, it's around 57.3%, the result is based on model pretrained on ImageNet.
@gaosh Thanks. I used randomcrop for traning motion data (and centercrop for the evaluation data) and then normalized the data to [0,1], now I don't get crazy jumps and downs in my validation loss, but the precision I get is 60-70% (resnet/for the first 6 classes of UCF101 / which should be much higher than UCF101 overall, and it is small but balanced enough to train without overfitting). Isn't your accuracy of UCF101 (around 80) overfitting?? When I run the code as is, I do get 80 and above (for 6 classes) but the network does not really converge and it would be a false measure without handling the problems of cyclical jumps and downs of validation loss and overfitting, no?
@duygusar you don't have to worry too much about over-fitting at the beginning. 60-70% accuracy is lower than expected and I think it's irrelevant to overfitting, just train longer and tracking the changes of training loss. Also, if you use small models like resnet-18, the final performance will be lower than reported results in this repo.
@gaosh when I shuffle the evaluation set I get low accuracy, It is around 80 but overfitting when I don't shuffle (in the repository it is not shuffled). And I can tell that it overfits because validation loss just won't go down after a while, and definitely does not converge even with smaller learning rates. By the way, in the repository I think the test set refers to evaluation set, is this correct? The evaluation set is not partitioned from the training set right? I am skimming through the code and I think test actually refers to evaluations set and if you needed an actual test you need to replace the test split with a new one (with unseen examples), I just found it peculiar and wanted to make sure if I am correct about this. So, I am confused about the reported accuracy because they don't provide a real test split. Is the accuracy on README the validation accuracy?
@duygusar The validation set in this code is different from training set. I am not sure why you need shuffle validation set, but shuffle should not affect performance.
@gaosh You are right, I don't need to shuffle as it is irrelevant but it does change the performance and I don't know why. The over-fitting remains either way (validation accuracy might be high but val loss does not converge), and I think the performance reported might be on validation set.
@duygusar If val loss first go down and then go up, it may related to overfitting. However, if the val loss go down and stay at a certain value, even though the value is higher than training loss, it's common.
I train the model with pretrained ResNet152, but I got the accuary only 30+%, I think it's too low, but I don't know how to imporve it. I use the open-source function of opencv to got my flow pics, may this causes the low accuary?
@DoubleYing Have you changed the number of classes accordingly? UCF has 101 classes, what is the number of classes for your dataset? opencv's flow is not great but I think it shouldn't make a huge difference.
yes, I have changed the classes number, and now I'm considering to change a way to extract flow. If I get a good result later, I will note here. Thanks for your answer.
@DoubleYing On my dataset, which should be somewhat easy and balanced, I also get lower accuracies for motion, I also use cv2's farneback (because it is easy and fast, I can change to a course to fine one though I prefer a faster algorithm, I will just skip the deep learning one they used because I have limited time before a deadline :( ). Did you manage to improve your results? @gaosh do you have any references to your changes on the motion-cnn part (especially motion dataloader, but if possible VGG modifications on network part too)? I would really appreciate if you can refer me to your changes. Getting 5 random crops, I should handle a tuple of images instead of a PIL image (TypeError: pic should be PIL Image or ndarray. Got <type 'tuple'>), I am kind of confused on how to go around that in the program in train/test, and there is also the channels, how to stack fivecrops...
@gaosh Using Lambda, I get the error, at line 55, in stackopf
flow[2*(j),:,:] = H
RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[224, 224]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (4)
and when I try to set flow = torch.FloatTensor(5, 2*self.in_channel,self.img_rows,self.img_cols)
I get motion_dataloader.py", line 55, in stackopf flow[:,2*(j),:,:] = H RuntimeError: expand(torch.FloatTensor{[5, 1, 224, 224]}, size=[5, 224, 224]): the number of sizes provided (3) must be greater or equal to the number of dimensions in the tensor (4)
when I multiply the train batchsize by 5 that is returned, I also get the same error.
Your also need to modify the code within motion_dataloader.py.
def stackopf(self, video_name, clip_idx, nb_clips=None):
name = 'v_' + video_name
u = self.flow_root_dir + 'u/' + name
v = self.flow_root_dir + 'v/' + name
if self.fiveCrops:
self.ncrops = 5
else:
self.ncrops = 1
flow = torch.FloatTensor(self.ncrops, 2 * self.in_channel, self.img_rows, self.img_cols)
#i = int(self.clips_idx)
i = clip_idx
for j in range(self.in_channel):
idx = i + j
if self.mode == 'train':
if idx >= nb_clips+1:
idx = nb_clips+1
idx = str(idx)
frame_idx = 'frame' + idx.zfill(6)
h_image = u + '/' + frame_idx + '.jpg'
v_image = v + '/' + frame_idx + '.jpg'
imgH = (Image.open(h_image))
imgV = (Image.open(v_image))
H = self.flow_transform(imgH)
V = self.flow_transform(imgV)
if self.fiveCrops:
flow[:, 2 * (j - 1), :, :] = H.squeeze()
flow[:, 2 * (j - 1) + 1, :, :] = V.squeeze()
else:
flow[:, 2 * (j - 1), :, :] = H
flow[:, 2 * (j - 1) + 1, :, :] = V
imgH.close()
imgV.close()
return flow.squeeze()
Please also notice that the returned image from dataloader will have size of (n_crops, batchsize, n_channels, height, weight)
. You need to resize the batch to (n_crops*batchsize, n_channels, height, weight)
. You can check official reference too.