What does this PR do?

Fixes #29640 . Add a new model, Video Llava to the library.

This is a draft PR, will add more here later.

The model now accepts both, video and image as input in the same batch. Each visual has its own special token, so we do not need to repeat "" 8 times for 8 frames. Here is a short code snippet:

import torch
import numpy as np
import requests
from PIL import Image
from decord import VideoReader
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor


def load_video_tensor(video_path, n_frms=4, transform=False):
    vr = VideoReader(uri=video_path, height=224, width=224)
    n_frms = min(n_frms, len(vr))
    indices = np.arange(0, len(vr), len(vr) / n_frms).astype(int)
    frames = vr.get_batch(indices).asnumpy()
    return frames

# -------------------------------------------------------------------------------------------------------------------

model = VideoLlavaForConditionalGeneration.from_pretrained("/home/raushan/video_llava/")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
clip = load_video_tensor(video_path, n_frms=8)

processor = VideoLlavaProcessor.from_pretrained("/home/raushan/video_llava")
prompt_vid = "USER: <video>What do you see? ASSISTANT: A baby with a remote control. USER: <video>Why is this funny? ASSISTANT:"
prompt_img = "USER: <image>How many cats are there in the image? ASSISTANT:"
prompt_mix = "USER: <image>How many cats are there in the image? ASSISTANT: 2. USER: <video>Why is this video funny? ASSISTANT:"

inputs = processor(text=[prompt_mix, prompt_img], visual_inputs=[image, clip, image], padding=True, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=40)
print(processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True))

# [
#    'USER:  How many cats are there in the image? ASSISTANT: 2. USER:  Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on 
#         the bed and reading a book, which is an unusual and amusing sight. Babies are typically not known for reading books, and the fact', 
#    'USER:  How many cats are there in the image? ASSISTANT: There are two cats in the image..'
# ]

Mar 19 '24 14:03 zucchini-nlp

@LinB203 hey! As we talked before, here is a draft PR of Video Llava. I checked that the modeling part runs without errors and generates similar to the original repo.

To update model files on the hub, you can use convert_weight script and use this branch to test if model is loading correctly. Whenever you are available, can you look through, if I missed anything important? :)

Mar 19 '24 14:03 zucchini-nlp

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Mar 19 '24 14:03 HuggingFaceDocBuilderDev

@LinB203 pinging in case the first one got lost in notifications :)

Mar 27 '24 16:03 zucchini-nlp

FYI @NielsRogge and @amyeroberts

Mar 30 '24 07:03 ArthurZucker

I believe we can start reviewing this now. I converted weights and added them to my hub account temporarily, so that we can run and test the model.

In the meanwhile will be waiting for @LinB203 to update the weights in the organization hub, which I think is required before we merge the PR

Apr 08 '24 12:04 zucchini-nlp

@zucchini-nlp Awesome work! First thing to do before a full review is resolving the failing tests. Some of these are unrelated, and rebasing on main should resolve. Looking at the CI, some of the failure as video llava specific - let me know if you need any help addressing them!

Apr 08 '24 13:04 amyeroberts

Rebased with main and resolved conflicts. The only failing doctest seems to be not able to load and run 7b model in 120sec, but I think we will leave it anyway to show how Video-Llava works

Apr 08 '24 16:04 zucchini-nlp

@zucchini-nlp You can exclude the model from running in the doc tests by adding it to slow_documentation_tests.txt.

Then, once the PR is reviewed in a steady state and ready for merging, we can run the slow tests and the documentation tests to make sure everything is correct before merging

Apr 10 '24 13:04 amyeroberts

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

May 08 '24 10:05 zucchini-nlp

Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library

Thanks for your great attention and employing the model in transformers. I wonder that which organization you mentioned? I have uploaded the weight in my personal repo, does it ok?

May 09 '24 06:05 LinB203

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

May 09 '24 13:05 zucchini-nlp

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

I see. I will make LanguageBind/Video-LLaVA-7B-hf repo and upload new converted weight by your script tonight. Thank you very much.

May 09 '24 14:05 LinB203

@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob

Finished. https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf/tree/main

May 09 '24 23:05 LinB203

@amyeroberts done! The last commit should trigger the slow tests after your approval.

Note about the failing code-check, it's failing because of the ARCHIVE_MAP which apparently was removed for all models. So I didn't add it for VideoLlava

May 10 '24 08:05 zucchini-nlp

@zucchini-nlp Great! :D Could you rebase to include the upstream changes like the ARCHIVE_MAP removal? This should make everything green and ensure it's just that triggering the errors

May 13 '24 14:05 amyeroberts

The PR passed all the tests, slow tests are passing for me locally. Should be good to go

May 15 '24 09:05 zucchini-nlp

@zucchini-nlp Great - let's merge! Do you have permission to do so? I can merge if not.

May 15 '24 09:05 amyeroberts

Just fixed one slow test, will merge when i get all green

May 15 '24 10:05 zucchini-nlp

transformers
transformers copied to clipboard

Add Video Llava

What does this PR do?

transformers transformers copied to clipboard

Add Video Llava

What does this PR do?

transformers
transformers copied to clipboard