axial-deeplab icon indicating copy to clipboard operation
axial-deeplab copied to clipboard

This repo is mentioned for classification only, so it cant be used to semantic segmentation yet?

Open shrutishrestha opened this issue 3 years ago • 13 comments

At the last layer of Axial attention, there is the implementation of the Linear fc layer, which gives the classification output. But, I need to work on semantic segmentation. So usingn axial attention as encoder, should I use this for unet decoder or else how can I proceed for segmentation task?

shrutishrestha avatar Sep 11 '20 00:09 shrutishrestha

Right. It does not support segmentation yet.

I would suggest to use it as an encoder. Please have a look at section 3.2 axial-deeplab for details. If you would like to use it as a decoder also, Appendix B provides our decoder design and the performance of it.

csrhddlam avatar Sep 11 '20 00:09 csrhddlam

keeping this as encoder can other decoders (eg unet ) be used for performing semantic segmentation?

shrutishrestha avatar Sep 11 '20 02:09 shrutishrestha

Yes.

csrhddlam avatar Sep 11 '20 02:09 csrhddlam

Are you planning to upload the code for the axial decoder anytime soon?

shrutishrestha avatar Sep 15 '20 11:09 shrutishrestha

No, but you could convert a block to a decoder block easily by inserting one or two upsamplings and 1x1 convolutions.

csrhddlam avatar Sep 15 '20 15:09 csrhddlam

Can you please check if this network design is correct or not while using axial deeplab encoder with decoder. Also after the decoder reaches 1,32,56,56 which are N,C,H,W what can be done to produce the output as exact size as of its image ?

IMG_20200917_085523__01.jpg

shrutishrestha avatar Sep 17 '20 03:09 shrutishrestha

It seems OK, but you should check the details carefally when you implement it.

Can you please check if this network design is correct or not while using axial deeplab encoder with decoder. Also after the decoder reaches 1,32,56,56 which are N,C,H,W what can be done to produce the output as exact size as of its image ?

IMG_20200917_085523__01.jpg

phj128 avatar Sep 18 '20 03:09 phj128

In the decoder block, we go 3,4,6,3 blocks in each layer as in encoder side, or else we have only one block in each layer as shown in figure above?

shrutishrestha avatar Sep 18 '20 05:09 shrutishrestha

@shrutishrestha Have you completed the decoder block?

sahilrider avatar Feb 15 '21 12:02 sahilrider

In the paper it is written "Firstly, to extract dense feature maps, DeepLab [13] changes the stride and atrous rates of the last one or two stages in ResNet [32]. Similarly, we remove the stride of the last stage but we do not implement the `atrous' attention module, since our axial-attention already captures global information for the whole input."

Does this mean for the last layer we don't need to do the downsampling and the output size of the last layer will be [1, 1024, 14, 14] for segmentation task? @csrhddlam

sahilrider avatar Feb 15 '21 19:02 sahilrider

Right, it will be [1, 1024, 14, 14], although in practice, the input resolution for semantic segmentation is usually larger than 224x224.

csrhddlam avatar Feb 19 '21 00:02 csrhddlam

Yes right. I was assuming 224x224 for calculation. Thanks.

sahilrider avatar Feb 19 '21 11:02 sahilrider

Can you please check if this network design is correct or not while using axial deeplab encoder with decoder. Also after the decoder reaches 1,32,56,56 which are N,C,H,W what can be done to produce the output as exact size as of its image ?

IMG_20200917_085523__01.jpg

I see the skip connection will be passed through conv1x1, would you please explain why you design in this way?

The second thing I notice is that in the decoder (in the right down corner), the input size is 1x1024x7x7, but somehow it becomes 1x512x7x7 before conv and upsample without any operation?

Thank you

lkqnaruto avatar Sep 27 '21 20:09 lkqnaruto