gmflow icon indicating copy to clipboard operation
gmflow copied to clipboard

About the modifications related to stereo matching task

Open tholmb opened this issue 3 years ago • 2 comments

I'm curious to use GMFlow for stereo matching task. I noticed the closed issue #13 where you suggested to replace 2D cross-attention in Transformers with 1D cross-attention and 2D global matching with 1D global matching.

Using Stereo-RAFT 1D correlation as a model I did manage to implement somehow 1D global matching but not sure at all if it is right or wrong (didn't manage to include pred_bidir_flow parameter).

def coords_grid(b, h, w):
    y, x = torch.meshgrid(torch.arange(h), torch.arange(w))  # [H, W]
    stacks = [x, y]
    grid = torch.stack(stacks, dim=0).float()  # [2, H, W] or [3, H, W]
    grid = grid[None].repeat(b, 1, 1, 1)  # [B, 2, H, W] or [B, 3, H, W]
    return grid

def global_correlation_softmax_1d(feature0, feature1):
    # global correlation
    b, c, h, w = feature0.shape

    feature0 = feature0.permute(0, 2, 3, 1)
    feature1 = feature1.permute(0, 2, 1, 3)

    corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B, H, W, W]

    # flow from softmax
    init_grid = coords_grid(b, h, w).to(corr.device)  # [B, 2, H, W]
    grid = init_grid.permute(0, 2, 3, 1)  # [B, H*W, 2]

    prob = F.softmax(corr, dim=-1)  # [B, H, W, W]

    correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2)  # [B, 2, H, W]

    # when predicting bidirectional flow, flow is the concatenation of forward flow and backward flow
    flow = correspondence - init_grid

    return flow

For the 2D cross-attention replacement with 1D variant I don't know exactly which function I should modify. I assume that modifications should be done in single_head_split_window_attention() function but no idea how.

I also noticed that the results of GMStereo is added to Middlebury stereo evalutation. Are you planning to release codes related to that project (this would solve automatically my issues related to modifications)?

tholmb avatar Sep 13 '22 10:09 tholmb

I may have found solution from your codes related to flow1d project. For 1D correlation it should be straight forward with this and for cross-attention with this. I just don't understand that why I should modify just cross-attention but not self-attention since they both are using single_head_split_window_attention() function?

tholmb avatar Sep 13 '22 11:09 tholmb

Maybe I just should modify FeatureFlowAttention instead of single_head_split_window_attention. After the modifications it will look like following:

class FeatureFlowAttention1D(nn.Module):
    """
    flow propagation with self-attention on feature
    query: feature0, key: feature0, value: flow
    """

    def __init__(self, in_channels, **kwargs):
        super(FeatureFlowAttention1D, self).__init__()

        self.q_proj = nn.Linear(in_channels, in_channels)
        self.k_proj = nn.Linear(in_channels, in_channels)

        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(self, feature0, flow):
        # q, k: feature [B, C, H, W], v: flow [B, 2, H, W]

        b, c, h, w = feature0.size()

        query = feature0.view(b, c, h * w).permute(0, 2, 1)  # [B, H*W, C]

        query = self.q_proj(query).view(b, h, w, c)  # [B, H, W, C]
        key = self.k_proj(query).view(b,h,w,c).permute(0,1,3,2)  # [B, H, C, W]

        value = flow.view(b, flow.size(1), h, w).permute(0, 2, 3, 1)  # [B, H, W, 2]

        scores = torch.matmul(query, key) / (c ** 0.5)  # [B, H, W, W]
        prob = torch.softmax(scores, dim=-1)

        out = torch.matmul(prob, value)  # [B, H, W, 2]
        out = out.permute(0, 3, 1, 2)  # [B, 2, H, W]

        return out

Can you confirm that now the code block above and in the initial message are OK for stereo matching task?

tholmb avatar Sep 13 '22 14:09 tholmb

For stereo matching, we should modify the cross-attention to 1D because the cross-attention models cross-view interactions via cross-view similarities. Thus it's redudant to perform 2D cross-attention because the corresponding pixels only lie on the 1D horizontal direction for the stereo matching task.

You shouldn't modify the FlowFeatureAttention to 1D since the propagation is based on self-similarity, which is 2D.

haofeixu avatar Sep 26 '22 14:09 haofeixu

Thanks for the reply! I'm still a little bit confused that which functions I should modify. It's clear that at least global_correlation_softmax() but I'm not sure that is it enough? I do understand that in stereo matching we are estimating x-directional pixel displacement (disparity) and thus no need for 2D cross-attention. In fact I'm just not sure that where the cross-attention actually happens.

Sorry I'm new with transformers and I don't understand fully the idea of attention. I will read your paper again and maybe after that I will have a better picture about the model and some follow-up questions.

tholmb avatar Sep 27 '22 10:09 tholmb

So if I have understood correctly, in TransformerBlock we run first self-attention, then cross-attention and lastly ffn. Both self-attention and cross-attention calls TranformerLayer class but the difference is that self-attention takes as input (source,source) and cross-attention (source,target). So we need to modify just cross-attention from 2d to 1d but not self-attention? Inside of TransformerLayer query, key and value goes to single_head_split_window_attention() and this is in my opinion only place to edit cross-attention.

Maybe with_shift parameter should always be False in all TransformerBlocks or do you think that it is possible to implement StereoMatching if it is True? Or maybe just bettter use single_head_full_attention function

tholmb avatar Sep 27 '22 10:09 tholmb

Small update: I trained the network for stereo matching task with sceneflow and it is producing some weird looking checkerboard artifact.

im_18_a_raft_viz

Do you have any idea where this can come from? I attached my modifications to the original code below (now FlowFeatureAttention is in 2D).

def global_correlation_softmax(feature0, feature1):
    b, c, h, w = feature0.shape

    feature0 = feature0.permute(0, 2, 3, 1) # [B,H,W,C]
    feature1 = feature1.permute(0, 2, 1, 3) # [B,H,C,W]

    corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B,H,W,W]
    corr = corr.view(b, h*w, w) # [B,H*W,W]

    init_grid = coords_grid(b, h, w).to(corr.device)  # [B,2,H,W]
    grid = init_grid.permute(0, 2, 3, 1)  # [B,H,W,2]

    prob = F.softmax(corr, dim=-1)  # [B, H*W, W]
    prob = prob.view(b, h, w, w) # [B,H,W,W]

    correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2)  # [B,2,H,W]

    flow = correspondence - init_grid # [B,2,H,W]

    return flow

def single_head_full_attention(q, k, v, h=None, w=None):
    b,l,c = q.shape

    q = q.view(b,h,w,c) # [B,H,W,C]
    k = k.view(b,h,w,c).permute(0, 1, 3, 2) # [B,H,C,W]
    v = v.view(b,h,w,c) # [B,H,W,C]

    scores = torch.matmul(q, k) / (c ** 0.5) # [B,H,W,W]
    scores = scores.view(b, h*w, w) # [B,H*W,W]
    attn = F.softmax(scores, dim=-1)  # [B,H*W,W]
    attn = attn.view(b, h, w, w) # [B,H,W,W]
    out = torch.matmul(attn, v)  # [B,H,W,C]
    out = out.view(b, h*w, c)  # [B,H*W,C]
    return out

# Note: I'm using now single_head_full_attention instead of swin transformer
# Note2: single_head_full_attention is used for both self-attention and cross-attention

tholmb avatar Sep 28 '22 08:09 tholmb

So if I have understood correctly, in TransformerBlock we run first self-attention, then cross-attention and lastly ffn. Both self-attention and cross-attention calls TranformerLayer class but the difference is that self-attention takes as input (source,source) and cross-attention (source,target). So we need to modify just cross-attention from 2d to 1d but not self-attention? Inside of TransformerLayer query, key and value goes to single_head_split_window_attention() and this is in my opinion only place to edit cross-attention.

Maybe with_shift parameter should always be False in all TransformerBlocks or do you think that it is possible to implement StereoMatching if it is True? Or maybe just bettter use single_head_full_attention function

Yes, the difference between self-attention and cross-attention lies that they take in different inputs (self: source, source, cross: source, target).

For stereo matching, yes, you can use full 1D attention at 1/8 resolution since the computational cost is usually acceptable.

haofeixu avatar Oct 01 '22 13:10 haofeixu

Small update: I trained the network for stereo matching task with sceneflow and it is producing some weird looking checkerboard artifact.

im_18_a_raft_viz

Do you have any idea where this can come from? I attached my modifications to the original code below (now FlowFeatureAttention is in 2D).

def global_correlation_softmax(feature0, feature1):
    b, c, h, w = feature0.shape

    feature0 = feature0.permute(0, 2, 3, 1) # [B,H,W,C]
    feature1 = feature1.permute(0, 2, 1, 3) # [B,H,C,W]

    corr = torch.matmul(feature0, feature1) / (c ** 0.5) # [B,H,W,W]
    corr = corr.view(b, h*w, w) # [B,H*W,W]

    init_grid = coords_grid(b, h, w).to(corr.device)  # [B,2,H,W]
    grid = init_grid.permute(0, 2, 3, 1)  # [B,H,W,2]

    prob = F.softmax(corr, dim=-1)  # [B, H*W, W]
    prob = prob.view(b, h, w, w) # [B,H,W,W]

    correspondence = torch.matmul(prob, grid).permute(0, 3, 1, 2)  # [B,2,H,W]

    flow = correspondence - init_grid # [B,2,H,W]

    return flow

def single_head_full_attention(q, k, v, h=None, w=None):
    b,l,c = q.shape

    q = q.view(b,h,w,c) # [B,H,W,C]
    k = k.view(b,h,w,c).permute(0, 1, 3, 2) # [B,H,C,W]
    v = v.view(b,h,w,c) # [B,H,W,C]

    scores = torch.matmul(q, k) / (c ** 0.5) # [B,H,W,W]
    scores = scores.view(b, h*w, w) # [B,H*W,W]
    attn = F.softmax(scores, dim=-1)  # [B,H*W,W]
    attn = attn.view(b, h, w, w) # [B,H,W,W]
    out = torch.matmul(attn, v)  # [B,H,W,C]
    out = out.view(b, h*w, c)  # [B,H*W,C]
    return out

# Note: I'm using now single_head_full_attention instead of swin transformer
# Note2: single_head_full_attention is used for both self-attention and cross-attention

I guess you are using Scene Flow pretrained model to do prediction on the Middlebury dataset, so there might be some generalization issues. I would suggest to check the results on Scene Flow validation set first to verify whether the model is implemented/trained properly.

haofeixu avatar Oct 01 '22 13:10 haofeixu

Thanks for the answer! I will validate with SceneFlow dataset and let you know if the unwanted behavior remains. In my opinion the issue can be closed.

tholmb avatar Oct 02 '22 19:10 tholmb

Hi @tholmb , I have released the details and code regarding how to extend GMFlow to stereo, you can find the info at https://haofeixu.github.io/unimatch/ and https://github.com/autonomousvision/unimatch, thanks.

haofeixu avatar Nov 13 '22 03:11 haofeixu