open_clip
open_clip copied to clipboard
CoCa RoBERTa Attention Map Size Issue
Hi! I'm trying to train CoCa using the pretrained RoBERTa weights (has the casual masking issue #445 been addressed?), but I am running into an error with the Attention Maps sizes. Any help would be greatly appreciated :).
Below is the command I'm running:
torchrun --nproc_per_node 4 m training.main \
--train-data="$COYO_PATH/train" \
--train-num-samples 3000000 \
--val-data="$COYO_PATH/val" \
--val-num-samples 10000 \
--dataset-type webdataset \
--batch-size 128 \
--warmup 2000 \
--epochs 100 \
--lr 5e-4 \
--precision amp \
--workers 6 \
--model "coca_roberta-ViT-B-32" \
--name "coca_coyo" \
--report-to "wandb" \
--wandb-project-name "open-clip-baseline" \
--imagenet-val "$IMAGENET_HOME/validation" \
--gather-with-grad \
--local-loss \
However, this errors:
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "src/training/main.py", line 508, in <module>
main(sys.argv[1:])
File "src/training/main.py", line 436, in main
train_one_epoch(model, data, loss, epoch, optimizer, scaler, scheduler, dist_model, args, tb_writer=writer)
File "src/training/train.py", line 101, in train_one_epoch
model_out = model(images, texts)
... (omitted for brevity)
File ".venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File ".venv/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1241, in forward
attn_output, attn_output_weights = F.multi_head_attention_forward(
File ".venv/lib/python3.10/site-packages/torch/nn/functional.py", line 5354, in multi_head_attention_forward
raise RuntimeError(f"The shape of the 2D attn_mask is {attn_mask.shape}, but should be {correct_2d_size}.")
RuntimeError: The shape of the 2D attn_mask is torch.Size([76, 76]), but should be (77, 77).
Inspecting the error, I tried to change the multi-modal context length to 77, which yields the following error:
../aten/src/ATen/native/cuda/NLLLoss2d.cu:104: nll_loss2d_forward_kernel: block: [38,0,0], thread: [13,0,0] Assertion `t >= 0 && t < n_classes` failed.
@sandeepmukh I think a few things wrong for this .... first, update to main branch.
Then, I think this is needed in CocaModel to replace current vocab_size logic btw text and multimodal text towers
if getattr(text_cfg, "hf_model_name", None) is not None:
vocab_size = getattr(self.text, "vocab_size", text_cfg.vocab_size)
else:
vocab_size = text_cfg.vocab_size
Also, the context_len used by tokenzier sources from text_cfg by default, so text_cfg and multimodal_cfg should have same context_len values in config (I think) to work best but I'm not 100% sure there.