About the modifications related to stereo matching task
I'm curious to use GMFlow for stereo matching task. I noticed the closed issue #13 where you suggested to replace 2D cross-attention in Transformers with 1D cross-attention and 2D global matching with 1D global matching.
Using Stereo-RAFT 1D correlation as a model I did manage to implement somehow 1D global matching but not sure at all if it is right or wrong (didn't manage to include pred_bidir_flow parameter).
def coords_grid(b, h, w):
y, x = torch.meshgrid(torch.arange(h), torch.arange(w)) # [H, W]
stacks = [x, y]
grid = torch.stack(stacks, dim=0).float() # [2, H, W] or [3, H, W]
grid = grid[None].repeat(b, 1, 1, 1) # [B, 2, H, W] or [B, 3, H, W]
return grid
def global_correlation_softmax_1d(feature0, feature1):
# global correlation
b, c, h, w = feature0.shape
feature0 = feature0.permute(0, 2, 3, 1)
feature1 = feature1.permute(0, 2, 1, 3)
corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B, H, W, W]
# flow from softmax
init_grid = coords_grid(b, h, w).to(corr.device) # [B, 2, H, W]
grid = init_grid.permute(0, 2, 3, 1) # [B, H*W, 2]
prob = F.softmax(corr, dim=-1) # [B, H, W, W]
correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2) # [B, 2, H, W]
# when predicting bidirectional flow, flow is the concatenation of forward flow and backward flow
flow = correspondence - init_grid
return flow
For the 2D cross-attention replacement with 1D variant I don't know exactly which function I should modify. I assume that modifications should be done in single_head_split_window_attention() function but no idea how.
I also noticed that the results of GMStereo is added to Middlebury stereo evalutation. Are you planning to release codes related to that project (this would solve automatically my issues related to modifications)?
I may have found solution from your codes related to flow1d project. For 1D correlation it should be straight forward with this and for cross-attention with this. I just don't understand that why I should modify just cross-attention but not self-attention since they both are using single_head_split_window_attention() function?
Maybe I just should modify FeatureFlowAttention instead of single_head_split_window_attention. After the modifications it will look like following:
class FeatureFlowAttention1D(nn.Module):
"""
flow propagation with self-attention on feature
query: feature0, key: feature0, value: flow
"""
def __init__(self, in_channels, **kwargs):
super(FeatureFlowAttention1D, self).__init__()
self.q_proj = nn.Linear(in_channels, in_channels)
self.k_proj = nn.Linear(in_channels, in_channels)
for p in self.parameters():
if p.dim() > 1:
nn.init.xavier_uniform_(p)
def forward(self, feature0, flow):
# q, k: feature [B, C, H, W], v: flow [B, 2, H, W]
b, c, h, w = feature0.size()
query = feature0.view(b, c, h * w).permute(0, 2, 1) # [B, H*W, C]
query = self.q_proj(query).view(b, h, w, c) # [B, H, W, C]
key = self.k_proj(query).view(b,h,w,c).permute(0,1,3,2) # [B, H, C, W]
value = flow.view(b, flow.size(1), h, w).permute(0, 2, 3, 1) # [B, H, W, 2]
scores = torch.matmul(query, key) / (c ** 0.5) # [B, H, W, W]
prob = torch.softmax(scores, dim=-1)
out = torch.matmul(prob, value) # [B, H, W, 2]
out = out.permute(0, 3, 1, 2) # [B, 2, H, W]
return out
Can you confirm that now the code block above and in the initial message are OK for stereo matching task?
For stereo matching, we should modify the cross-attention to 1D because the cross-attention models cross-view interactions via cross-view similarities. Thus it's redudant to perform 2D cross-attention because the corresponding pixels only lie on the 1D horizontal direction for the stereo matching task.
You shouldn't modify the FlowFeatureAttention to 1D since the propagation is based on self-similarity, which is 2D.
Thanks for the reply! I'm still a little bit confused that which functions I should modify. It's clear that at least global_correlation_softmax() but I'm not sure that is it enough? I do understand that in stereo matching we are estimating x-directional pixel displacement (disparity) and thus no need for 2D cross-attention. In fact I'm just not sure that where the cross-attention actually happens.
Sorry I'm new with transformers and I don't understand fully the idea of attention. I will read your paper again and maybe after that I will have a better picture about the model and some follow-up questions.
So if I have understood correctly, in TransformerBlock we run first self-attention, then cross-attention and lastly ffn. Both self-attention and cross-attention calls TranformerLayer class but the difference is that self-attention takes as input (source,source) and cross-attention (source,target). So we need to modify just cross-attention from 2d to 1d but not self-attention? Inside of TransformerLayer query, key and value goes to single_head_split_window_attention() and this is in my opinion only place to edit cross-attention.
Maybe with_shift parameter should always be False in all TransformerBlocks or do you think that it is possible to implement StereoMatching if it is True? Or maybe just bettter use single_head_full_attention function
Small update: I trained the network for stereo matching task with sceneflow and it is producing some weird looking checkerboard artifact.

Do you have any idea where this can come from? I attached my modifications to the original code below (now FlowFeatureAttention is in 2D).
def global_correlation_softmax(feature0, feature1):
b, c, h, w = feature0.shape
feature0 = feature0.permute(0, 2, 3, 1) # [B,H,W,C]
feature1 = feature1.permute(0, 2, 1, 3) # [B,H,C,W]
corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B,H,W,W]
corr = corr.view(b, h*w, w) # [B,H*W,W]
init_grid = coords_grid(b, h, w).to(corr.device) # [B,2,H,W]
grid = init_grid.permute(0, 2, 3, 1) # [B,H,W,2]
prob = F.softmax(corr, dim=-1) # [B, H*W, W]
prob = prob.view(b, h, w, w) # [B,H,W,W]
correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2) # [B,2,H,W]
flow = correspondence - init_grid # [B,2,H,W]
return flow
def single_head_full_attention(q, k, v, h=None, w=None):
b,l,c = q.shape
q = q.view(b,h,w,c) # [B,H,W,C]
k = k.view(b,h,w,c).permute(0, 1, 3, 2) # [B,H,C,W]
v = v.view(b,h,w,c) # [B,H,W,C]
scores = torch.matmul(q, k) / (c ** 0.5) # [B,H,W,W]
scores = scores.view(b, h*w, w) # [B,H*W,W]
attn = F.softmax(scores, dim=-1) # [B,H*W,W]
attn = attn.view(b, h, w, w) # [B,H,W,W]
out = torch.matmul(attn, v) # [B,H,W,C]
out = out.view(b, h*w, c) # [B,H*W,C]
return out
# Note: I'm using now single_head_full_attention instead of swin transformer
# Note2: single_head_full_attention is used for both self-attention and cross-attention
So if I have understood correctly, in TransformerBlock we run first self-attention, then cross-attention and lastly ffn. Both self-attention and cross-attention calls TranformerLayer class but the difference is that self-attention takes as input (source,source) and cross-attention (source,target). So we need to modify just cross-attention from 2d to 1d but not self-attention? Inside of TransformerLayer query, key and value goes to single_head_split_window_attention() and this is in my opinion only place to edit cross-attention.
Maybe with_shift parameter should always be False in all TransformerBlocks or do you think that it is possible to implement StereoMatching if it is True? Or maybe just bettter use single_head_full_attention function
Yes, the difference between self-attention and cross-attention lies that they take in different inputs (self: source, source, cross: source, target).
For stereo matching, yes, you can use full 1D attention at 1/8 resolution since the computational cost is usually acceptable.
Small update: I trained the network for stereo matching task with sceneflow and it is producing some weird looking checkerboard artifact.
Do you have any idea where this can come from? I attached my modifications to the original code below (now FlowFeatureAttention is in 2D).
def global_correlation_softmax(feature0, feature1): b, c, h, w = feature0.shape feature0 = feature0.permute(0, 2, 3, 1) # [B,H,W,C] feature1 = feature1.permute(0, 2, 1, 3) # [B,H,C,W] corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B,H,W,W] corr = corr.view(b, h*w, w) # [B,H*W,W] init_grid = coords_grid(b, h, w).to(corr.device) # [B,2,H,W] grid = init_grid.permute(0, 2, 3, 1) # [B,H,W,2] prob = F.softmax(corr, dim=-1) # [B, H*W, W] prob = prob.view(b, h, w, w) # [B,H,W,W] correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2) # [B,2,H,W] flow = correspondence - init_grid # [B,2,H,W] return flow def single_head_full_attention(q, k, v, h=None, w=None): b,l,c = q.shape q = q.view(b,h,w,c) # [B,H,W,C] k = k.view(b,h,w,c).permute(0, 1, 3, 2) # [B,H,C,W] v = v.view(b,h,w,c) # [B,H,W,C] scores = torch.matmul(q, k) / (c ** 0.5) # [B,H,W,W] scores = scores.view(b, h*w, w) # [B,H*W,W] attn = F.softmax(scores, dim=-1) # [B,H*W,W] attn = attn.view(b, h, w, w) # [B,H,W,W] out = torch.matmul(attn, v) # [B,H,W,C] out = out.view(b, h*w, c) # [B,H*W,C] return out # Note: I'm using now single_head_full_attention instead of swin transformer # Note2: single_head_full_attention is used for both self-attention and cross-attention
I guess you are using Scene Flow pretrained model to do prediction on the Middlebury dataset, so there might be some generalization issues. I would suggest to check the results on Scene Flow validation set first to verify whether the model is implemented/trained properly.
Thanks for the answer! I will validate with SceneFlow dataset and let you know if the unwanted behavior remains. In my opinion the issue can be closed.
Hi @tholmb , I have released the details and code regarding how to extend GMFlow to stereo, you can find the info at https://haofeixu.github.io/unimatch/ and https://github.com/autonomousvision/unimatch, thanks.