transformers
transformers copied to clipboard
Add Video Llava
What does this PR do?
Fixes #29640 . Add a new model, Video Llava to the library.
This is a draft PR, will add more here later.
The model now accepts both, video and image as input in the same batch. Each visual has its own special token, so we do not need to repeat "
import torch
import numpy as np
import requests
from PIL import Image
from decord import VideoReader
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaForConditionalGeneration, VideoLlavaProcessor
def load_video_tensor(video_path, n_frms=4, transform=False):
vr = VideoReader(uri=video_path, height=224, width=224)
n_frms = min(n_frms, len(vr))
indices = np.arange(0, len(vr), len(vr) / n_frms).astype(int)
frames = vr.get_batch(indices).asnumpy()
return frames
# -------------------------------------------------------------------------------------------------------------------
model = VideoLlavaForConditionalGeneration.from_pretrained("/home/raushan/video_llava/")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
clip = load_video_tensor(video_path, n_frms=8)
processor = VideoLlavaProcessor.from_pretrained("/home/raushan/video_llava")
prompt_vid = "USER: <video>What do you see? ASSISTANT: A baby with a remote control. USER: <video>Why is this funny? ASSISTANT:"
prompt_img = "USER: <image>How many cats are there in the image? ASSISTANT:"
prompt_mix = "USER: <image>How many cats are there in the image? ASSISTANT: 2. USER: <video>Why is this video funny? ASSISTANT:"
inputs = processor(text=[prompt_mix, prompt_img], visual_inputs=[image, clip, image], padding=True, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=40)
print(processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True))
# [
# 'USER: How many cats are there in the image? ASSISTANT: 2. USER: Why is this video funny? ASSISTANT: The video is funny because the baby is sitting on
# the bed and reading a book, which is an unusual and amusing sight. Babies are typically not known for reading books, and the fact',
# 'USER: How many cats are there in the image? ASSISTANT: There are two cats in the image..'
# ]
@LinB203 hey! As we talked before, here is a draft PR of Video Llava. I checked that the modeling part runs without errors and generates similar to the original repo.
To update model files on the hub, you can use convert_weight script and use this branch to test if model is loading correctly. Whenever you are available, can you look through, if I missed anything important? :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@LinB203 pinging in case the first one got lost in notifications :)
FYI @NielsRogge and @amyeroberts
I believe we can start reviewing this now. I converted weights and added them to my hub account temporarily, so that we can run and test the model.
In the meanwhile will be waiting for @LinB203 to update the weights in the organization hub, which I think is required before we merge the PR
@zucchini-nlp Awesome work! First thing to do before a full review is resolving the failing tests. Some of these are unrelated, and rebasing on main should resolve. Looking at the CI, some of the failure as video llava specific - let me know if you need any help addressing them!
Rebased with main and resolved conflicts. The only failing doctest seems to be not able to load and run 7b model in 120sec, but I think we will leave it anyway to show how Video-Llava works
@zucchini-nlp You can exclude the model from running in the doc tests by adding it to slow_documentation_tests.txt.
Then, once the PR is reviewed in a steady state and ready for merging, we can run the slow tests and the documentation tests to make sure everything is correct before merging
Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library
Hey @LinB203, can you let us know if you can upload HF weights of VideoLlava to the organization? The model seems rready to be added to the library
Thanks for your great attention and employing the model in transformers. I wonder that which organization you mentioned? I have uploaded the weight in my personal repo, does it ok?
@LinB203 I mean the weights that can loaded into the transformers model that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob
@LinB203 I mean the weights that can loaded into the
transformersmodel that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob
I see. I will make LanguageBind/Video-LLaVA-7B-hf repo and upload new converted weight by your script tonight. Thank you very much.
@LinB203 I mean the weights that can loaded into the
transformersmodel that were converted by the "convert_weights" script I added in this PR. I already have it tested and added weights to my account in the hub, but it would be nice if it was under the "LanguageBind" org. If you need to keep the current version of the state dict, you can call the new model as "LanguageBind/Video-LLaVA-7B-hf" prob
Finished. https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf/tree/main
@amyeroberts done! The last commit should trigger the slow tests after your approval.
Note about the failing code-check, it's failing because of the ARCHIVE_MAP which apparently was removed for all models. So I didn't add it for VideoLlava
@zucchini-nlp Great! :D Could you rebase to include the upstream changes like the ARCHIVE_MAP removal? This should make everything green and ensure it's just that triggering the errors
The PR passed all the tests, slow tests are passing for me locally. Should be good to go
@zucchini-nlp Great - let's merge! Do you have permission to do so? I can merge if not.
Just fixed one slow test, will merge when i get all green