query2labels icon indicating copy to clipboard operation
query2labels copied to clipboard

transformer里pos embedding和vis emdedding的size对不上,无法相加

Open wennyHou opened this issue 3 years ago • 9 comments

我首先用build_q2l函数建立了一个model,然后用一个randn tensor作为model的输入,发现模型在forward过程中会有这个问题。

Traceback (most recent call last):
  File "debug.py", line 7, in <module>
    output = model(input)
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/query2label.py", line 78, in forward
    hs = self.transformer(self.input_proj(src), query_input, pos)[0] # B,K,d
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 107, in forward
    memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 134, in forward
    output = layer(output, src_mask=mask,
  File "/mnt/data3/ai/miniconda/envs/hwy_ReceiptCls/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 253, in forward
    return self.forward_post(src, src_mask, src_key_padding_mask, pos)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 217, in forward_post
    q = k = self.with_pos_embed(src, pos)
  File "/mnt/data3/houwanyi/ReceiptCls/query2labels/lib/models/transformer.py", line 208, in with_pos_embed
    return tensor if pos is None else tensor + pos
RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

wennyHou avatar Jan 26 '22 01:01 wennyHou

RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

sorrowyn avatar Jan 26 '22 07:01 sorrowyn

RuntimeError: The size of tensor a (1024) must match the size of tensor b (2048) at non-singleton dimension 2

是什么原因会导致图像的embedding size和的pos size不一致而无法相加呢?

wennyHou avatar Jan 26 '22 08:01 wennyHou

你对pos_embedding做多个实例化 或者你试试这个 PositionEmbeddingLearned(nn.Module): https://github.com/facebookresearch/detr/blob/main/models/position_encoding.py

sorrowyn avatar Jan 27 '22 08:01 sorrowyn

加上--keep_input_proj参数再试试

nekoosuki avatar Feb 23 '22 06:02 nekoosuki

same error 我在未改动代码时复现模型训练 单卡复现训练时报错 swin_L_384_22k 会报错 resnet backbone不报错 会不会是后几次commit时改什么东西没测试呀

python3 -m torch.distributed.launch --nproc_per_node=1 \ main_mlc.py \ --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 \ --output "/home/bpfs/querry2" \ --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 \ --gamma_pos 0 --gamma_neg 2 --dtgfl \ --epochs 80 --lr 1e-4 --optim AdamW \ --num_class 80 --img_size 384 --weight-decay 1e-2 \ --cutout --n_holes 1 --cut_fact 0.5 \ --hidden_dim 2048 --dim_feedforward 8192 \ --enc_layers 1 --dec_layers 2 --nheads 4 \ --early-stop --amp 报错: No inplace_abn found, please make sure you won't use TResNet as backbone! No inplace_abn found, please make sure you won't use TResNet as backbone! single GPU train | distributed init (local_rank 0): tcp://127.0.0.1:3717 [05/29 23:54:08.581]: Command: main_mlc.py --local_rank=0 --backbone swin_L_384_22k --dataname coco14 --batch-size 8 --print-freq 100 --output /home/bpfsrw3/makaili/models/querry2 --world-size 1 --rank 0 --dist-url tcp://127.0.0.1:3717 --gamma_pos 0 --gamma_neg 2 --dtgfl --epochs 80 --lr 1e-4 --optim AdamW --num_class 80 --img_size 384 --weight-decay 1e-2 --cutout --n_holes 1 --cut_fact 0.5 --hidden_dim 2048 --dim_feedforward 8192 --enc_layers 1 --dec_layers 2 --nheads 4 --early-stop --amp [05/29 23:54:08.583]: Full config saved to /home/bpfsrw3/makaili/models/querry2/config.json [05/29 23:54:08.583]: world size: 1 [05/29 23:54:08.584]: dist.get_rank(): 0 [05/29 23:54:08.584]: local_rank: 0 [05/29 23:54:08.584]: build model build_q2l 1 build_backbone 2 swin_L_384_22k 00pretrained model 11pretrained model 22pretrained model backbone done build_backbone success 2 set model.input_proj to Indentify! [05/29 23:54:27.831]: build model success [05/29 23:54:33.135]: make criterion Using Cutout!!! loading annotations into memory... Done (t=16.46s) creating index... index created! loading annotations into memory... Done (t=9.67s) creating index... index created! len(train_dataset): 82783 len(val_dataset): 40504 /home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/utils/data/dataloader.py:477: UserWarning: This DataLoader will create 32 worker processes in total. Our suggested max number of worker in current system is 28, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. cpuset_checked)) [05/29 23:55:03.681]: lr:4.000000000000002e-06 Traceback (most recent call last): File "main_mlc.py", line 727, in <module> main() File "main_mlc.py", line 224, in main return main_worker(args, logger) File "main_mlc.py", line 351, in main_worker loss = train(train_loader, model, ema_m, criterion, optimizer, scheduler, epoch, args, logger) File "main_mlc.py", line 481, in train output = model(images) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/query2label.py", line 78, in forward hs = self.transformer(self.input_proj(src), query_input, pos)[0] # B,K,d File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 108, in forward memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 135, in forward src_key_padding_mask=src_key_padding_mask, pos=pos) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 247, in forward return self.forward_post(src, src_mask, src_key_padding_mask, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 214, in forward_post q = k = self.with_pos_embed(src, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 207, in with_pos_embed return tensor if pos is None else tensor + pos RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 2 Killing subprocess 4198 Traceback (most recent call last): File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module> main() File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main sigkill_handler(signal.SIGTERM, None) # not coming back File "/home/bpfsrw3/makaili/software/py37n/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd) subprocess.CalledProcessError: Command '['/home/bpfsrw3/makaili/software/py37n/bin/python3', '-u', 'main_mlc.py', '--local_rank=0', '--backbone', 'swin_L_384_22k', '--dataname', 'coco14', '--batch-size', '8', '--print-freq', '100', '--output', '/home/bpfsrw3/makaili/models/querry2', '--world-size', '1', '--rank', '0', '--dist-url', 'tcp://127.0.0.1:3717', '--gamma_pos', '0', '--gamma_neg', '2', '--dtgfl', '--epochs', '80', '--lr', '1e-4', '--optim', 'AdamW', '--num_class', '80', '--img_size', '384', '--weight-decay', '1e-2', '--cutout', '--n_holes', '1', '--cut_fact', '0.5', '--hidden_dim', '2048', '--dim_feedforward', '8192', '--enc_layers', '1', '--dec_layers', '2', '--nheads', '4', '--early-stop', '--amp']' returned non-zero exit status 1.

主要就是这个

File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 214, in forward_post q = k = self.with_pos_embed(src, pos) File "/home/bpfsrw3/makaili/project/query2labels/lib/models/transformer.py", line 207, in with_pos_embed return tensor if pos is None else tensor + pos RuntimeError: The size of tensor a (1536) must match the size of tensor b (2048) at non-singleton dimension 2 Killing subprocess 4198

macqueen09 avatar May 29 '22 16:05 macqueen09

把hidden_dim设置成1536试试,--hidden_dim 1536

zugofn avatar Jun 22 '22 14:06 zugofn

可以参考下作者提供的这个配置文件,需要把hidden dim改成1024, dim_feedforward改成4096,img_size 384,其他的一些细节设置应该不影响代码运行,但如果是要复现应该也需要和作者设置的一样

verazuo avatar Oct 30 '22 22:10 verazuo

Maybe out of context of the original issue, but is there any concrete reason for suggesting dim_feedforward to be 4* hidden_dim? Can they be same? as in the paper, using d=d0 = 2432 for every other model say resnet50 d = d0 = 2048 ?!

saishkomalla avatar Nov 17 '22 14:11 saishkomalla

请问楼主怎么解决的这个问题,我也碰到了相同的问题

Jianghold avatar Aug 17 '24 01:08 Jianghold