SegFormer
SegFormer copied to clipboard
About the ImageNet1k pretrain
Hi, in class MixVisionTransformer, function 'forward_features' return a list, and the classification head is a Linear module, I'm confused about the details of how you deal with it.
I think the commented out line should be
# x = self.head(x[3])
instead of
# x = self.head(x)
you are right.
hmmm I don't think that's correct. if x refers to the output of all stages, namely x[0]...x[3], then x[3] should be of size B*N*C. It can't be followed with an nn.Linear(C, NumClass) to output B*NumClass, you would only get B*N*NumClass; however, x should be the output of stage 4, then x[3] refers to the 3rd feature token in the batch, with dimension N*C. This will fix the error but is absolutely wrong.
should be x.mean(dim=1) (before reshaping to B*C*H*W) reference: https://github.com/whai362/PVT/blob/cceb465b7dfb2b7a48b39074a14a04dedab427e8/classification/pvt_v2.py#L292
should be x.mean(dim=1) (before reshaping to BCH*W) reference: https://github.com/whai362/PVT/blob/cceb465b7dfb2b7a48b39074a14a04dedab427e8/classification/pvt_v2.py#L292
Could you tell me the detailed process about the ImageNet1K pretrain,for example, for Segformer-B2, the output of backbone is a list of x:x[0],x[1],x[2],x[3], and the dimension of them is [batch,64,1/4W,1/4H], [batch,128,1/8W,1/8H],[batch,320,1/16W,1/16H],[batch,512,1/32W,1/32H], and then what should i do?