SegFormer icon indicating copy to clipboard operation
SegFormer copied to clipboard

About the ImageNet1k pretrain

Open zhanggj821 opened this issue 2 years ago • 5 comments

Hi, in class MixVisionTransformer, function 'forward_features' return a list, and the classification head is a Linear module, I'm confused about the details of how you deal with it.

zhanggj821 avatar Aug 28 '22 09:08 zhanggj821

I think the commented out line should be # x = self.head(x[3]) instead of # x = self.head(x)

timdirr avatar Aug 30 '22 13:08 timdirr

you are right.

zhanggj821 avatar Aug 30 '22 13:08 zhanggj821

hmmm I don't think that's correct. if x refers to the output of all stages, namely x[0]...x[3], then x[3] should be of size B*N*C. It can't be followed with an nn.Linear(C, NumClass) to output B*NumClass, you would only get B*N*NumClass; however, x should be the output of stage 4, then x[3] refers to the 3rd feature token in the batch, with dimension N*C. This will fix the error but is absolutely wrong.

wangh09 avatar Apr 18 '23 08:04 wangh09

should be x.mean(dim=1) (before reshaping to B*C*H*W) reference: https://github.com/whai362/PVT/blob/cceb465b7dfb2b7a48b39074a14a04dedab427e8/classification/pvt_v2.py#L292

wangh09 avatar Apr 18 '23 08:04 wangh09

should be x.mean(dim=1) (before reshaping to BCH*W) reference: https://github.com/whai362/PVT/blob/cceb465b7dfb2b7a48b39074a14a04dedab427e8/classification/pvt_v2.py#L292

Could you tell me the detailed process about the ImageNet1K pretrain,for example, for Segformer-B2, the output of backbone is a list of x:x[0],x[1],x[2],x[3], and the dimension of them is [batch,64,1/4W,1/4H], [batch,128,1/8W,1/8H],[batch,320,1/16W,1/16H],[batch,512,1/32W,1/32H], and then what should i do?

waw123456 avatar Jan 08 '24 08:01 waw123456