transformers Error while using CLIP embeddings with VisualBERT.

Error while using CLIP embeddings with VisualBERT.

Open nityanandmathur opened this issue 1 year ago • 1 comments

System Info

transformers version: 4.26.0
Platform: Linux-4.18.0-348.2.1.el8_5.x86_64-x86_64-with-glibc2.28
Python version: 3.10.8
Huggingface_hub version: 0.11.1
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I am trying to use CLIP embeddings with VisualBERT for multimodal image classification.

Generating CLIP embeddings for text(j[1]) and images(j[0]) for a batch in dataloader.
Providing these embeddings to the VisualBERT model.
Calculating cross entropy loss.

model.train()

for epoch in range(EPOCH):
    for j in tqdm(trainloader):
        # Features
        text_tokens = clip.tokenize(j[1]).to(DEVICE)
        j[0] = j[0].to(DEVICE)
        with torch.no_grad():
            text_features = clip_model.encode_text(text_tokens).to(DEVICE)
            image_features = clip_model.encode_image(j[0]).to(DEVICE)
        
        print(text_features.shape)
        print(image_features.shape)
        visualbert_inputs = {
            "inputs_embeds": text_features.to(DEVICE),
            "visual_embeds": image_features.to(DEVICE),
        }

        # Forward Pass
        output = model(**visualbert_inputs)
        loss = loss_fn(output,j[2]).to(DEVICE)

        #Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"EPOCH:{epoch}, LOSS:{loss.item()}")

Error:

Expected behavior

The VisualBERT model requires input embeddings of dimension inputs_embeds (torch.FloatTensor of shape (batch_size, sequence_length, hidden_size). How to convert the CLIP encodings to the input embeddings of VisualBERT?

Mar 23 '23 21:03 nityanandmathur

cc @ArthurZucker and @amyeroberts

Mar 23 '23 21:03 sgugger

transformers transformers copied to clipboard

Error while using CLIP embeddings with VisualBERT.

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard