transformers
transformers copied to clipboard
Error while using CLIP embeddings with VisualBERT.
System Info
-
transformers
version: 4.26.0 - Platform: Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.28
- Python version: 3.10.8
- Huggingface_hub version: 0.11.1
- PyTorch version (GPU?): 1.12.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
Who can help?
No response
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
I am trying to use CLIP embeddings with VisualBERT for multimodal image classification.
- Generating CLIP embeddings for text(j[1]) and images(j[0]) for a batch in dataloader.
- Providing these embeddings to the VisualBERT model.
- Calculating cross entropy loss.
model.train()
for epoch in range(EPOCH):
for j in tqdm(trainloader):
# Features
text_tokens = clip.tokenize(j[1]).to(DEVICE)
j[0] = j[0].to(DEVICE)
with torch.no_grad():
text_features = clip_model.encode_text(text_tokens).to(DEVICE)
image_features = clip_model.encode_image(j[0]).to(DEVICE)
print(text_features.shape)
print(image_features.shape)
visualbert_inputs = {
"inputs_embeds": text_features.to(DEVICE),
"visual_embeds": image_features.to(DEVICE),
}
# Forward Pass
output = model(**visualbert_inputs)
loss = loss_fn(output,j[2]).to(DEVICE)
#Backpropagation
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"EPOCH:{epoch}, LOSS:{loss.item()}")
Error:
Expected behavior
The VisualBERT model requires input embeddings of dimension inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size)
.
How to convert the CLIP encodings to the input embeddings of VisualBERT?
cc @ArthurZucker and @amyeroberts