LLaVA-NeXT LLaVa-NeXT-Video is added to 🤗 Transformers!

trafficstars

Hey all!

The video models are all supported in Transformers now and will be part of the v4.42 release. Feel free to check out the model checkpoints here.

To get the model, update transformers by running: !pip install --upgrade git+https://github.com/huggingface/transformers.git. Inference with videos can be done as follows:

import av
import numpy as n
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor

def read_video_pyav(container, indices):
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf",  device_map='auto')

video_path = "YOUR-LOCAL-VIDEO-PATH
container = av.open(video_path)

# sample uniformly 8 or more frames, depending on length of video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

# Prepare a chat formatted input
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
      {
          "role": "user",
          "content": [
              {"type": "text", "text": "What do you see in this video?"},
              {"type": "video"},
              ],
      },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.9)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

Useful links: Colab for inference Colab for fine-tuning Transformers docs

Jun 27 '24 11:06 zucchini-nlp

Cool! Thanks a lot!!!

Jun 27 '24 12:06 ZhangYuanhan-AI

Does this also support the latest interleaved version?

Jul 02 '24 02:07 CS123n

How to implement batch inference?

Jul 04 '24 08:07 HarperGG

Hi, thanks for your excellent work! I wonder know that why when I use 32 frames per video, the model begin to randomly output irrelevant answers, but when I use less, like 28 24 16 8, I can get the normal answer. this model should process max to 32 frames right?

Jul 07 '24 08:07 zhengrongz

@HarperGG please take a look at the colab notebook for inference, there are code snippets for batch inference near the end.

@zhengrongz yes, the config was missing rope scaling factor. Should work now, if you have even longer context/frames consider increasing the rope scaling factor

@CS123n no, currently it's not available in transformers

Jul 08 '24 08:07 zucchini-nlp

@HarperGG please take a look at the colab notebook for inference, there are code snippets for batch inference near the end.

@zhengrongz yes, the config was missing rope scaling factor. Should work now, if you have even longer context/frames consider increasing the rope scaling factor

@CS123n no, currently it's not available in transformers

That's OK! Thank you! I have another question maybe a little stupid. If I only want to get the answer after the "ASSISTANT:", what change should I do? (That is, the final output does not carry the USER and my questions) I don't want to use regular expressions to filter the output again if I can.

Jul 08 '24 08:07 zhengrongz

@zhengrongz you can strip off the prompt text based on input length similar to below:

inputs = processor(prompt, videos=clip, return_tensors="pt")
input_length = inputs.input_ids.shape[[-1]
output = model.generate(**inputs)
output_stripped = output[:, input_length: ] # strip off 'input_length' tokens from beginning

Jul 08 '24 08:07 zucchini-nlp

@zhengrongz you can strip off the prompt text based on input length similar to below:

inputs = processor(prompt, videos=clip, return_tensors="pt")
input_length = inputs.input_ids.shape[[-1]
output = model.generate(**inputs)
output_stripped = output[:, input_length: ] # strip off 'input_length' tokens from beginning

I know! Thanks!

Jul 08 '24 08:07 zhengrongz

@zucchini-nlp Thank you for your awesome work! I want to ask if the finetuning code works with image input as well? I have made some attempt but it seem not there is mismatch in the sizes of the tensor. It would be great if you could point out some hints. Thanks!

Jul 17 '24 09:07 Namzakku

@Namzakku hey! yes, it should work with any inputs actually. Can you show the error you encountered and the minimal code to reproduce the error?

Jul 17 '24 09:07 zucchini-nlp

@zucchini-nlp Thanks for quick reply!

As the read_video_xxx function does not work with image, I changed to using PIL like below in the collate_fn()


def collate_fn(example, path):
    image_file = example['image_path']
    image = Image.open(os.path.join(path, image_file))

    conversation = example['conversations']

    prompt = processor.apply_chat_template(conversation, add_generation_prompt=False)

    batch = processor(
        text=prompt,
        images=image,
        truncation=True,
        max_length=MAX_LENGTH,
        return_tensors="pt"
    )
    print(example['image_path'])
    return batch

then I changed "pixel_values_video" in class LlavaNextVideoDataCollatorWithPadding to "pixel_values" with everything else be the same, I ran trainer and got the below error


RuntimeError                              Traceback (most recent call last)
Cell In[21], line 1
----> 1 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1930         hf_hub_utils.enable_progress_bars()
   1931 else:
-> 1932     return inner_training_loop(
   1933         args=args,
   1934         resume_from_checkpoint=resume_from_checkpoint,
   1935         trial=trial,
   1936         ignore_keys_for_eval=ignore_keys_for_eval,
   1937     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2230, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2227     rng_to_sync = True
   2229 step = -1
-> 2230 for step, inputs in enumerate(epoch_iterator):
   2231     total_batched_samples += 1
   2233     if self.args.include_num_input_tokens_seen:

File /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:454, in DataLoaderShard.__iter__(self)
    452 # We iterate one batch ahead to check when we are at the end
    453 try:
--> 454     current_batch = next(dataloader_iter)
    455 except StopIteration:
    456     yield

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
    627 if self._sampler_iter is None:
    628     # TODO(https://github.com/pytorch/pytorch/issues/76750)
    629     self._reset()  # type: ignore[call-arg]
--> 630 data = self._next_data()
    631 self._num_yielded += 1
    632 if self._dataset_kind == _DatasetKind.Iterable and \
    633         self._IterableDataset_len_called is not None and \
    634         self._num_yielded > self._IterableDataset_len_called:

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1345, in _MultiProcessingDataLoaderIter._next_data(self)
   1343 else:
   1344     del self._task_info[idx]
-> 1345     return self._process_data(data)

File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1371, in _MultiProcessingDataLoaderIter._process_data(self, data)
   1369 self._try_put_index()
   1370 if isinstance(data, ExceptionWrapper):
-> 1371     data.reraise()
   1372 return data

File /usr/local/lib/python3.10/dist-packages/torch/_utils.py:694, in ExceptionWrapper.reraise(self)
    690 except TypeError:
    691     # If the exception takes multiple arguments, don't try to
    692     # instantiate since we don't know how to
    693     raise RuntimeError(msg) from None
--> 694 raise exception

RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/tmp/ipykernel_1223/1722779256.py", line 18, in __call__
    padded_inputs["pixel_values"] = torch.cat([feat['pixel_values'] for feat in features], dim=0)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 4 but got size 5 for tensor number 2 in the list.

Jul 17 '24 10:07 Namzakku

Ah I see, forgot that llava-next processes images in patches.and each image can contain different number of patches, unlike videos where the number of frames is fixed.

A solution can be to get rid of PaddingCollator and do instead smth like CollateForModel which will take image path and text, and will process then into tensors on the fly, while training. Why i opted for two collates is because saving videos and then loading on-the -fly is slow because they require more memory than images.

Also take a look at this notebook on Llava (https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb). It will be easily customizable for llava-next (in image-text case) by changing a few input args

Jul 17 '24 12:07 zucchini-nlp

Thanks for the advice! I will look further into the above notebook!

Jul 17 '24 13:07 Namzakku

@Namzakku Hello, in case there are still interests, I've put up a codebase https://github.com/zjysteven/lmms-finetune that supports finetuning of various types of LMMs, including llava-next-interleave and llava-next-video. It took inspiration from the above finetuning notebook by the huggingface staff, but differs from it in some ways including 1) it uses huggingface trainer and 2) it sets labels to only the model responses rather than both user prompts + model responses (as the notebook did). Feel free to try it out and I would appreciate any comments/feedbacks.

Jul 20 '24 20:07 zjysteven

Hey can anyone point me to somewhere that specifies the recommended infrastructure to deploy/run this model?

Eg. How much VRAM, CPU cores needed etc etc.

Aug 02 '24 04:08 RukshanJS

hey, running this code:

import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor


def read_video_pyav(container, indices):
    """
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    """
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])


# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    "llava-hf/LLaVA-NeXT-Video-7B-hf",
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2",
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")

# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = "/mnt/nfs/decart/amir/tarsier/429916.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Please provide a detailed description of the video, focusing on the main subjects, their actions, the background scenes.",
            },
            {"type": "video"},
        ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt").to(0)

out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print(out)

I get this error:

Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00,  2.01s/it]
Chat templates should be in a 'chat_template.json' file but found key='chat_template' in the processor's config. Make sure to move your template to its own file.
/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/feature_extraction_utils.py:142: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:261.)
  return torch.tensor(value)
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Traceback (most recent call last):
  File "/mnt/nfs/LLaVA-NeXT/llavanext.py", line 60, in <module>
    out = model.generate(**inputs, max_new_tokens=60)
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
    result = self._sample(
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/generation/utils.py", line 2982, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/mnt/nfs/decart/miniconda3/envs/share4video-amir/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/models/llava_next_video/modeling_llava_next_video.py", line 915, in forward
    first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
  File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/cache_utils.py", line 334, in __getitem__
    raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'```

Aug 08 '24 09:08 ameeramer

@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)

Aug 08 '24 12:08 zucchini-nlp

Hi @zucchini-nlp I tried to train on multiple gpus but keep receiving the below error. Do you have any idea how to solve it?

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

This is what I got from model.hf_device_map.

{'image_newline': 0,
 'vision_tower': 0,
 'multi_modal_projector': 0,
 'language_model.model.embed_tokens': 0,
 'language_model.model.layers.0': 0,
 'language_model.model.layers.1': 0,
 'language_model.model.layers.2': 0,
 'language_model.model.layers.3': 0,
 'language_model.model.layers.4': 0,
 'language_model.model.layers.5': 0,
 'language_model.model.layers.6': 0,
 'language_model.model.layers.7': 0,
 'language_model.model.layers.8': 0,
 'language_model.model.layers.9': 0,
 'language_model.model.layers.10': 0,
 'language_model.model.layers.11': 0,
 'language_model.model.layers.12': 1,
 'language_model.model.layers.13': 1,
 'language_model.model.layers.14': 1,
 'language_model.model.layers.15': 1,
 'language_model.model.layers.16': 1,
 'language_model.model.layers.17': 1,
 'language_model.model.layers.18': 1,
 'language_model.model.layers.19': 1,
 'language_model.model.layers.20': 1,
 'language_model.model.layers.21': 1,
 'language_model.model.layers.22': 1,
 'language_model.model.layers.23': 1,
 'language_model.model.layers.24': 1,
 'language_model.model.layers.25': 1,
 'language_model.model.layers.26': 1,
 'language_model.model.layers.27': 1,
 'language_model.model.layers.28': 1,
 'language_model.model.layers.29': 1,
 'language_model.model.layers.30': 1,
 'language_model.model.layers.31': 1,
 'language_model.model.norm': 1,
 'language_model.lm_head': 1}

Aug 14 '24 02:08 Namzakku

@Namzakku Hmm, from the device-map there doesn't seem to be anything to cause device mismatch errors. Can you share the full traceback to see where exactly are tensors on different devices and a minimal reproducer?

Aug 14 '24 05:08 zucchini-nlp

@zucchini-nlp Thanks for the response! This is the full traceback

/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[32], line 1
----> 1 trainer.train()

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
   1930         hf_hub_utils.enable_progress_bars()
   1931 else:
-> 1932     return inner_training_loop(
   1933         args=args,
   1934         resume_from_checkpoint=resume_from_checkpoint,
   1935         trial=trial,
   1936         ignore_keys_for_eval=ignore_keys_for_eval,
   1937     )

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2268, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
   2265     self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
   2267 with self.accelerator.accumulate(model):
-> 2268     tr_loss_step = self.training_step(model, inputs)
   2270 if (
   2271     args.logging_nan_inf_filter
   2272     and not is_torch_xla_available()
   2273     and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
   2274 ):
   2275     # if loss is nan or inf simply add the average of previous logged losses
   2276     tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3307, in Trainer.training_step(self, model, inputs)
   3304     return loss_mb.reduce_mean().detach().to(self.args.device)
   3306 with self.compute_loss_context_manager():
-> 3307     loss = self.compute_loss(model, inputs)
   3309 del inputs
   3311 kwargs = {}

File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3338, in Trainer.compute_loss(self, model, inputs, return_outputs)
   3336 else:
   3337     labels = None
-> 3338 outputs = model(**inputs)
   3339 # Save past state if it exists
   3340 # TODO: this needs to be fixed and made cleaner later.
   3341 if self.args.past_index >= 0:

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:825, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
    824 def forward(*args, **kwargs):
--> 825     return model_forward(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:813, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
    812 def __call__(self, *args, **kwargs):
--> 813     return convert_to_fp32(self.model_forward(*args, **kwargs))

File /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
     13 @functools.wraps(func)
     14 def decorate_autocast(*args, **kwargs):
     15     with autocast_instance:
---> 16         return func(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/peft/peft_model.py:762, in PeftModel.forward(self, *args, **kwargs)
    760 with self._enable_peft_forward_hooks(*args, **kwargs):
    761     kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args}
--> 762     return self.get_base_model()(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164         output = module._old_forward(*args, **kwargs)
    165 else:
--> 166     output = module._old_forward(*args, **kwargs)
    167 return module._hf_hook.post_forward(module, output)

File /usr/local/lib/python3.10/dist-packages/transformers/models/llava_next/modeling_llava_next.py:855, in LlavaNextForConditionalGeneration.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    851         attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
    853         position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
--> 855 outputs = self.language_model(
    856     attention_mask=attention_mask,
    857     position_ids=position_ids,
    858     past_key_values=past_key_values,
    859     inputs_embeds=inputs_embeds,
    860     use_cache=use_cache,
    861     output_attentions=output_attentions,
    862     output_hidden_states=output_hidden_states,
    863     return_dict=return_dict,
    864 )
    866 logits = outputs[0]
    868 loss = None

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:1174, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
   1171 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1173 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1174 outputs = self.model(
   1175     input_ids=input_ids,
   1176     attention_mask=attention_mask,
   1177     position_ids=position_ids,
   1178     past_key_values=past_key_values,
   1179     inputs_embeds=inputs_embeds,
   1180     use_cache=use_cache,
   1181     output_attentions=output_attentions,
   1182     output_hidden_states=output_hidden_states,
   1183     return_dict=return_dict,
   1184     cache_position=cache_position,
   1185 )
   1187 hidden_states = outputs[0]
   1188 if self.config.pretraining_tp > 1:

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:967, in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
    964     all_hidden_states += (hidden_states,)
    966 if self.gradient_checkpointing and self.training:
--> 967     layer_outputs = self._gradient_checkpointing_func(
    968         decoder_layer.__call__,
    969         hidden_states,
    970         causal_mask,
    971         position_ids,
    972         past_key_values,
    973         output_attentions,
    974         use_cache,
    975         cache_position,
    976     )
    977 else:
    978     layer_outputs = decoder_layer(
    979         hidden_states,
    980         attention_mask=causal_mask,
   (...)
    985         cache_position=cache_position,
    986     )

File /usr/local/lib/python3.10/dist-packages/torch/_compile.py:24, in _disable_dynamo.<locals>.inner(*args, **kwargs)
     20 @functools.wraps(fn)
     21 def inner(*args, **kwargs):
     22     import torch._dynamo
---> 24     return torch._dynamo.disable(fn, recursive)(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:328, in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
    326 dynamic_ctx.__enter__()
    327 try:
--> 328     return fn(*args, **kwargs)
    329 finally:
    330     set_eval_frame(prior)

File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py:17, in wrap_inline.<locals>.inner(*args, **kwargs)
     15 @functools.wraps(fn)
     16 def inner(*args, **kwargs):
---> 17     return fn(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:451, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
    446     if context_fn is not noop_context_fn or debug is not False:
    447         raise ValueError(
    448             "Passing `context_fn` or `debug` is only supported when "
    449             "use_reentrant=False."
    450         )
--> 451     return CheckpointFunction.apply(function, preserve, *args)
    452 else:
    453     gen = _checkpoint_without_reentrant_generator(
    454         function, preserve, context_fn, determinism_check, debug, *args, **kwargs
    455     )

File /usr/local/lib/python3.10/dist-packages/torch/autograd/function.py:539, in Function.apply(cls, *args, **kwargs)
    536 if not torch._C._are_functorch_transforms_active():
    537     # See NOTE: [functorch vjp and autograd interaction]
    538     args = _functorch.utils.unwrap_dead_wrappers(args)
--> 539     return super().apply(*args, **kwargs)  # type: ignore[misc]
    541 if cls.setup_context == _SingleLevelFunction.setup_context:
    542     raise RuntimeError(
    543         "In order to use an autograd.Function with functorch transforms "
    544         "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
    545         "staticmethod. For more details, please see "
    546         "https://pytorch.org/docs/master/notes/extending.func.html"
    547     )

File /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:230, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
    227 ctx.save_for_backward(*tensor_inputs)
    229 with torch.no_grad():
--> 230     outputs = run_function(*args)
    231 return outputs

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164         output = module._old_forward(*args, **kwargs)
    165 else:
--> 166     output = module._old_forward(*args, **kwargs)
    167 return module._hf_hook.post_forward(module, output)

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:715, in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, **kwargs)
    694 """
    695 Args:
    696     hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
   (...)
    711         into the model
    712 """
    713 residual = hidden_states
--> 715 hidden_states = self.input_layernorm(hidden_states)
    717 # Self Attention
    718 hidden_states, self_attn_weights, present_key_value = self.self_attn(
    719     hidden_states=hidden_states,
    720     attention_mask=attention_mask,
   (...)
    725     cache_position=cache_position,
    726 )

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164         output = module._old_forward(*args, **kwargs)
    165 else:
--> 166     output = module._old_forward(*args, **kwargs)
    167 return module._hf_hook.post_forward(module, output)

File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:88, in LlamaRMSNorm.forward(self, hidden_states)
     86 variance = hidden_states.pow(2).mean(-1, keepdim=True)
     87 hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
---> 88 return self.weight * hidden_states.to(input_dtype)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

The code I used is mainly from the below link https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb

but I modified it to use huggingface trainer instead of pytorch lightning.

The modification is mainly in the LlavaNextDataset and I also customized a train collate in order to support multi-turn conversations like below.

class LlavaNextDataset(Dataset):
    """
    PyTorch Dataset for LLaVa-NeXT. This class takes a HuggingFace Dataset as input.

    Each row, consists of image path(png/jpg/jpeg) and ground truth data (json/jsonl/txt).
    """

    def __init__(
        self,
        dataset,
        split: str = "train",
    ):
        super().__init__()

        self.split = split

        self.dataset = dataset
        self.dataset_length = len(self.dataset)

    def __len__(self) -> int:
        return self.dataset_length

    def __getitem__(self, idx: int) -> Dict:
        """
        Returns one item of the dataset.

        Returns:
            image : the original Receipt image
            target_sequence : tokenized ground truth sequence
        """
        sample = self.dataset[idx]

        # inputs
        image = Image.open(os.path.join(DATASET_PATH, sample["image_path"]))
        conversations = sample["conversations"]

        return image, conversations


class TrainCollate:
    def __init__(self, processor, max_length):
        self.processor = processor
        self.max_length = max_length

    def __call__(self, examples):
        images = []
        texts = []

        for example in examples:
            image, conversations = example
            images.append(image)

            text_prompt = self.processor.apply_chat_template(conversations)
            texts.append(text_prompt)

        batch = self.processor(
            text=texts, 
            images=images, 
            padding=True, 
            truncation=True, 
            max_length=self.max_length, 
            return_tensors="pt"
        )

        labels = batch["input_ids"].clone()
        labels[labels == self.processor.tokenizer.pad_token_id] = -100
        batch["labels"] = labels

        return batch

Other than that, how the model is loaded and how peft is applied are the same from the notebook. Below are the training parameters.

args = TrainingArguments(

    # args related to training
    output_dir = OUTPUT_DIR,
    eval_strategy = 'steps',
    eval_steps=100,
    per_device_train_batch_size = BATCH_SIZE,
    per_device_eval_batch_size = BATCH_SIZE,
    gradient_accumulation_steps = 16,
    learning_rate = 2e-04,
    # max_steps = 100, # adjust this depending on your dataset size
    lr_scheduler_type = 'cosine',
    warmup_ratio = 0.1,
    num_train_epochs=1,

    # args related to eval/save
    logging_strategy="steps",
    logging_steps=1,   
    save_strategy = 'steps',
    save_steps=100,
    save_total_limit = 1,
    fp16 = True, # we have the model train and eval with fp16 precision
    fp16_full_eval = True,
    optim = 'adamw_bnb_8bit', # adam in lower-bits to save memory, consider changing to 'adamw_torch' if model is not converging
    report_to = "wandb", # install wand to use this
    run_name = WANDB_NAME,
    # hub_model_id = REPO_ID,
    # push_to_hub = True, # wel'll push the model to hub after each epoch
    # model that was wrapped for QLORA training with peft will not have arguments listed in its signature
    # so we need to pass lable names explicitly to calculate val loss
    label_names=["labels"],
    dataloader_num_workers=16, # let's get more workers since iterating on video datasets might be slower in general
)

trainer = Trainer(
    model = model,
    tokenizer = processor,
    data_collator = TrainCollate(processor, MAX_LENGTH),
    train_dataset = train_dataset,
    eval_dataset = val_dataset,
    args=args,
)

Aug 14 '24 06:08 Namzakku

@zucchini-nlp Hi again! Sorry for keep asking questions but could you verify that if the current huggingface model supports multi-turn chat? I try to go through the code in this folder, look at the processing_llava_next_video.py and it mentions preparing the model for serveral sequences, but I wasn't clear how the 2nd and 3rd value forward of the ground truth is handled in the below collate?

def train_collate_fn(examples):
    images = []
    texts = []
    for example in examples:
        image, ground_truth = example
        images.append(image)
        # TODO: in the future we can replace this by processor.apply_chat_template
        prompt = f"USER: <image>\nExtract JSON.\nASSISTANT: {ground_truth}"
        texts.append(prompt)

    batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")

    labels = batch["input_ids"].clone()
    labels[labels == processor.tokenizer.pad_token_id] = -100
    batch["labels"] = labels

    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    pixel_values = batch["pixel_values"]
    labels = batch["labels"]

    return input_ids, attention_mask, pixel_values, labels


def eval_collate_fn(examples):
    # we only feed the prompt to the model
    images = []
    texts = []
    answers = []
    for example in examples:
        image, ground_truth = example
        images.append(image)
        # TODO: in the future we can replace this by processor.apply_chat_template
        prompt = f"USER: <image>\nExtract JSON.\nASSISTANT:"
        texts.append(prompt)
        answers.append(ground_truth)

    batch = processor(text=texts, images=images, return_tensors="pt", padding=True)

    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    pixel_values = batch["pixel_values"]

    return input_ids, attention_mask, pixel_values, answers

given the data example is like below

[
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "prompt1"},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": ground_truth1},
        ],
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "prompt2"},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": ground_truth2},
        ],
    }
]

Aug 15 '24 02:08 Namzakku

@Namzakku Yes, it supports multi-turn conversations just the way you have it in the example. You just need to pass in the convo to processor.apply_chat_template() and you'll get a correct prompt.

Regarding training, I guess you want to train on all assistant's turns, not only the last one. In that case you can add processor.apply_chat_template(args, return_assistant_tokens_mask=True) which will return a mask for user's and assistant's turns. I am not sure right now if we have the assistant masking on llava-next-video templates, so let me know if it fails for any of the checkpoints and I'll update the template :)

In any case, you have to prepare labels and mask out unattended parts with -100 in collate function

Aug 15 '24 04:08 zucchini-nlp

@zucchini-nlp Thanks for the hint! I have managed to get the labels based on the assistant mask! I found that the chat_template in this processor config is in older version?, which is different from this chat_template.json that you updated a few days ago. After changing the chat template, I managed to get the below labels tensor.

tensor([[    1,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
           -100,  -100,  -100,  -100,  -100,  5962, 29918,   509,  2806, 29896,
           3148,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
           5962, 29918,   509,  2806, 29906, 29871]]))

with the below train collate function

def train_collate_fn(examples):
    images = []
    texts = []
    assistant_masks_batch = []

    for example in examples:
        image, conversations = example
        images.append(image)
        text_prompt = processor.apply_chat_template(conversations)
        texts.append(text_prompt)

        # assistant masks
        assistant_masks_output = processor.apply_chat_template(conversations, return_assistant_tokens_mask=True, return_dict=True, tokenize=True)
        assistant_masks_batch.append(assistant_masks_output["assistant_masks"])

    batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")

    labels = batch["input_ids"].clone()
    for label, assistant_masks in zip(labels, assistant_masks_batch):
        for index, mask in enumerate(assistant_masks):
            if mask != 1:
                label[index + 1] = -100
    labels[labels == processor.tokenizer.pad_token_id] = -100
    batch["labels"] = labels

    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    pixel_values = batch["pixel_values"]
    image_sizes = batch["image_sizes"]
    labels = batch["labels"]

    return input_ids, attention_mask, pixel_values, image_sizes, labels

I'm also trying to figure out ways to do with the evaluation collate function. Can you provide some hints?

Aug 15 '24 08:08 Namzakku

@Namzakku oh yeah, the processor config is actually deprecated and I'll delete it soon (totally forgot it's still there). You need a new transformers version so that chat template is loaded from it own file

For eval you should be able to do the same thing, if you want to calculate val-loss. Unfortunately HFTrainer doesn't support generation while evaluating yet, so I guess you need only val-loss

Aug 15 '24 09:08 zucchini-nlp

@zucchini-nlp oh yeah, I also noticed that HFTrainer doesn't support generation so I am now switching back to using lightning pytorch. I was wondering how to apply the assistant masks in the generation code, as the labels are not used in the validation_step in the notebook.

Aug 15 '24 10:08 Namzakku

@Namzakku I see, if you have your own validation loop you still can apply attention masks and prepare inputs for the loss (in case you want to keep track of val-loss).

An if you want to run generation in validation loop, you need to crop the assistant's last turn and add_generation_prompt=True when applying the chat template. Feeding the output to model will give an assistant's answer on which you can calculate other metrics, like exact match or bleu/rouge etc scores

Oh, also, regarding the device mismatch. I couldn't reproduce it when running inference. Can you make sure you have the latest accelerate version

Aug 15 '24 12:08 zucchini-nlp

@zucchini-nlp yeah I have the latest version of accelerate but still not working with multigpu training from notebook. When I switch to using scripts, it works well. I just wonder why the pytorch lightning version takes lots of gpu comparing to hftrainer. I barely fit batch = 1 to a 80GB gpu when using pytorch lightning... but can get bigger batch on 48GB GPU with hftrainer. But the outcome from the pytorch lightning seems better..

An if you want to run generation in validation loop, you need to crop the assistant's last turn and add_generation_prompt=True when applying the chat template. Feeding the output to model will give an assistant's answer on which you can calculate other metrics, like exact match or bleu/rouge etc scores

Is there anyway I can get the assistant's answer in the in-between turns?

Aug 20 '24 08:08 Namzakku

Is there anyway I can get the assistant's answer in the in-between turns?

No, the model can only generate next several tokens. For that you have to write your own generation loop which will get assistant't answer and append user's next turn to the conversation. And that way generates the whole conversation turn-by-turn

I just wonder why the pytorch lightning version takes lots of gpu comparing to hftrainer

Noted, will look into that and the flaky device mismatch errors. Several people already reported it, so probably composite models aren't correctly placed on devices with accelerate. I'll add these to my TODO list, will need more time to dig into

Aug 30 '24 15:08 zucchini-nlp

@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)

Dear authors, I have encountered the same issue when running the inference code (as shown in images attached). Do you have any ideas about how to fixed it? Thanks.

Nov 19 '24 11:11 Holmes-GU

@Holmes-GU which version of transformers you are using? Can you update to the latest release and check once more?

Nov 20 '24 06:11 zucchini-nlp

LLaVA-NeXT LLaVA-NeXT copied to clipboard

LLaVa-NeXT-Video is added to 🤗 Transformers!

LLaVA-NeXT
LLaVA-NeXT copied to clipboard