transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Error while using CLIP embeddings with VisualBERT.

Open nityanandmathur opened this issue 1 year ago • 1 comments

System Info

  • transformers version: 4.26.0
  • Platform: Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.28
  • Python version: 3.10.8
  • Huggingface_hub version: 0.11.1
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed

Who can help?

No response

Information

  • [ ] The official example scripts
  • [X] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

I am trying to use CLIP embeddings with VisualBERT for multimodal image classification.

  1. Generating CLIP embeddings for text(j[1]) and images(j[0]) for a batch in dataloader.
  2. Providing these embeddings to the VisualBERT model.
  3. Calculating cross entropy loss.
model.train()

for epoch in range(EPOCH):
    for j in tqdm(trainloader):
        # Features
        text_tokens = clip.tokenize(j[1]).to(DEVICE)
        j[0] = j[0].to(DEVICE)
        with torch.no_grad():
            text_features = clip_model.encode_text(text_tokens).to(DEVICE)
            image_features = clip_model.encode_image(j[0]).to(DEVICE)
        
        print(text_features.shape)
        print(image_features.shape)
        visualbert_inputs = {
            "inputs_embeds": text_features.to(DEVICE),
            "visual_embeds": image_features.to(DEVICE),
        }

        # Forward Pass
        output = model(**visualbert_inputs)
        loss = loss_fn(output,j[2]).to(DEVICE)

        #Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"EPOCH:{epoch}, LOSS:{loss.item()}")

Error: image

Expected behavior

The VisualBERT model requires input embeddings of dimension inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size). How to convert the CLIP encodings to the input embeddings of VisualBERT?

nityanandmathur avatar Mar 23 '23 21:03 nityanandmathur

cc @ArthurZucker and @amyeroberts

sgugger avatar Mar 23 '23 21:03 sgugger