pytorch-vsumm-reinforce
pytorch-vsumm-reinforce copied to clipboard
Reproduce features on TVSum and SumMe
Hi,
I am trying to compute the features of each frame in the video (on SumMe and TVSum). My features, when they are indexed per 15 frames, match in dimension with the features provided here, but the values are different. I searched both the code and the other issues, and I found here that you mention preprocess(frame)
, but not exactly your steps. I guess, that is our difference.
My preprocessing steps are:
- load the video with shape
[frames, channels, height, width]
, withdesired_fps=2
anddesired_size=(224, 224)
. - Then use this transformation
transform = transforms.Compose([
transforms.Lambda(lambda x: x / 255), # [0, 255] -> [0, 1]
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])])
- and finally -after the computation though GoogleNet- divide each vector with its norm, to get a unit feature-vector.
Hi. I'm also trying to reproduce feature of vsumm video. but failed..
My all process steps are
import torch
from torchvision.models import googlenet
self.device = torch.device('cuda:0')
self.googlenet = googlenet(pretrained=True)
self.extractor = torch.nn.Sequential(*list(self.googlenet.children())[:-2]).to(self.device)
self.preprocess = transforms.Compose([ # https://pytorch.org/hub/pytorch_vision_googlenet/
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])
im = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # BGR to RGB
im = Image.fromarray(im) # cv2 to PIL
im = self.preprocess(im)
im = im.unsqueeze(0).to(self.device) # it should be shape : (1,3,224,224
with torch.no_grad():
feature = self.extractor(im).cpu().numpy().flatten() # [1(N), 1024, 1, 1] -> [1024]