iPerceive
iPerceive copied to clipboard
Some doubts about your code
Your research is nice. As for your code, are you sure your code is runnable? I have tried to run your code about iPerceiveDVC but there are lots of low-level bugs. Please explain about them. Thx.
`class EncoderLayer(nn.Module):
def __init__(self, d_model, dout_p, H, d_ff):
super(EncoderLayer, self).__init__()
self.res_layers = clone(ResidualConnection(d_model, dout_p), 2)
# Discard encoder's self-multiheaded attention module in place for common-sense features
# self.self_att = MultiheadedAttention(d_model, H)
self.feed_forward = PositionwiseFeedForward(d_model, d_ff)
def forward(self, x, src_mask): # x - (B, seq_len, d_model) src_mask (B, 1, S)
# sublayer should be a function which inputs x and outputs transformation
# thus, lambda is used instead of just `self.self_att(x, x, x)` which outputs
# the output of the self attention
sublayer0 = lambda x: self.self_att(x, x, x, src_mask)
sublayer1 = self.feed_forward
x = self.res_layers[0](x, sublayer0)
x = self.res_layers[1](x, sublayer1)
return x # x - (B, seq_len, d_model)`
`def forward(self, pred, target): # pred (B, S, V), target (B, S) # Note: preds are expected to be after log B, S, V = pred.shape # (B, S, V) -> (B * S, V); (B, S) -> (B * S) pred = pred.contiguous().view(-1, V) target = target.contiguous().view(-1)
dist = self.smoothing * torch.ones_like(pred) / (V - 2)
# add smoothed ground-truth to prior (args: dim, index, src (value))
dist.scatter_(1, target.unsqueeze(-1).long(), 1-self.smoothing)
# make the padding token to have zero probability
dist[:, self.pad_idx] = 0
# ?? mask: 1 if target == pad_idx; 0 otherwise
mask = torch.nonzero(target == self.pad_idx)
if mask.sum() > 0 and len(mask) > 0:
# dim, index, val
dist.index_fill_(0, mask.squeeze(), 0)
#return F.kl_div(pred, dist, reduction='sum')
return F.kl_div(pred, dist, reduction='sum') + self.bce_loss + self.reg_loss + self.l2_loss`
I've uncommented the line that defines self_att
. I'm not sure what you're trying to highlight with the latter part of the code snippet though.
The following code block will cause a problem that "unsupported operand type(s) for +: 'Tensor' and 'BCEWithLogitsLoss'". It seems that this line " return F.kl_div(pred, dist, reduction='sum') + self.bce_loss + self.reg_loss + self.l2_loss" has some problems. ` def forward(self, pred, target): # pred (B, S, V), target (B, S) B, S, V = pred.shape pred = pred.contiguous().view(-1, V) target = target.contiguous().view(-1)
dist = self.smoothing * torch.ones_like(pred) / (V - 2)
# add smoothed ground-truth to prior (args: dim, index, src (value))
dist.scatter_(1, target.unsqueeze(-1).long(), 1-self.smoothing)
# make the padding token to have zero probability
dist[:, self.pad_idx] = 0
# ?? mask: 1 if target == pad_idx; 0 otherwise
mask = torch.nonzero(target == self.pad_idx)
if mask.sum() > 0 and len(mask) > 0:
# dim, index, val
dist.index_fill_(0, mask.squeeze(), 0)
#return F.kl_div(pred, dist, reduction='sum')
return F.kl_div(pred, dist, reduction='sum') + self.bce_loss + self.reg_loss + self.l2_loss
`
Hi @Linxxx Have you used csfeatures? I want to ask some questions about csfeatures.
@BNU-Wu Yes, I was trying to use csfeatures but I failed, some bugs occured.
你是否面临了这些问题?Regarding csfeatures, I have two questions for you. Questions 1:I run your iPerceive and get that there should be N 1024 features in each frame (N represents the detected object). In what way did you merge them into a 1024 feature to represent this frame? Questions 2: csfeatures is combined with video_stack_rgb through the hstack function. When running the code, an error will be reported. The reason is The reason is that the merged video_stack_rgb is 2048, and video_stack_flow is 1024. So they cannot be merged. How did you solve this problem? If you can provide the csfeatures file or have a solution, I will be very grateful for your help.
@BNU-Wu 我遇到过第二个问题。这个feature融不进去,在loss部分的直接相加是有问题的 “ return F.kl_div(pred, dist, reduction='sum') + self.bce_loss + self.reg_loss + self.l2_loss”
我想请问一下第一个问题你跑完iPerceive后,得到的每帧是不是有多个对象特征?你是怎么融合到一个1024特征的呢?对于第二个问题通过hstack去融合CS特征和I3D特征后,它已经变成2048了,怎么跟video_stack_flow 进行融合了?
@BNU-Wu , for the first part I added the multiple object features together to just get a single feature representation but I didn't get good results with that. I don't know whether this is the right approach. I also need answer to this question.
@BNU-Wu , for the first part I added the multiple object features together to just get a single feature representation but I didn't get good results with that. I don't know whether this is the right approach. I also need answer to this question.
Hi @siyamsajeebkhan I took the same method as you, and the result was not good. I don't know what the problem is.