LLaVA-NeXT
LLaVA-NeXT copied to clipboard
LLaVa-NeXT-Video is added to 🤗 Transformers!
Hey all!
The video models are all supported in Transformers now and will be part of the v4.42 release. Feel free to check out the model checkpoints here.
To get the model, update transformers by running: !pip install --upgrade git+https://github.com/huggingface/transformers.git. Inference with videos can be done as follows:
import av
import numpy as n
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
def read_video_pyav(container, indices):
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
model = LlavaNextVideoForConditionalGeneration.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf", device_map='auto')
video_path = "YOUR-LOCAL-VIDEO-PATH
container = av.open(video_path)
# sample uniformly 8 or more frames, depending on length of video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
# Prepare a chat formatted input
# Each "content" is a list of dicts and you can add image/video/text modalities
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "What do you see in this video?"},
{"type": "video"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=clip, return_tensors="pt")
# Generate
generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.9)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])
Useful links: Colab for inference Colab for fine-tuning Transformers docs
Cool! Thanks a lot!!!
Does this also support the latest interleaved version?
How to implement batch inference?
Hi, thanks for your excellent work! I wonder know that why when I use 32 frames per video, the model begin to randomly output irrelevant answers, but when I use less, like 28 24 16 8, I can get the normal answer. this model should process max to 32 frames right?
@HarperGG please take a look at the colab notebook for inference, there are code snippets for batch inference near the end.
@zhengrongz yes, the config was missing rope scaling factor. Should work now, if you have even longer context/frames consider increasing the rope scaling factor
@CS123n no, currently it's not available in transformers
@HarperGG please take a look at the colab notebook for inference, there are code snippets for batch inference near the end.
@zhengrongz yes, the config was missing rope scaling factor. Should work now, if you have even longer context/frames consider increasing the rope scaling factor
@CS123n no, currently it's not available in transformers
That's OK! Thank you! I have another question maybe a little stupid. If I only want to get the answer after the "ASSISTANT:", what change should I do? (That is, the final output does not carry the USER and my questions) I don't want to use regular expressions to filter the output again if I can.
@zhengrongz you can strip off the prompt text based on input length similar to below:
inputs = processor(prompt, videos=clip, return_tensors="pt")
input_length = inputs.input_ids.shape[[-1]
output = model.generate(**inputs)
output_stripped = output[:, input_length: ] # strip off 'input_length' tokens from beginning
@zhengrongz you can strip off the prompt text based on input length similar to below:
inputs = processor(prompt, videos=clip, return_tensors="pt") input_length = inputs.input_ids.shape[[-1] output = model.generate(**inputs) output_stripped = output[:, input_length: ] # strip off 'input_length' tokens from beginning
I know! Thanks!
@zucchini-nlp Thank you for your awesome work! I want to ask if the finetuning code works with image input as well? I have made some attempt but it seem not there is mismatch in the sizes of the tensor. It would be great if you could point out some hints. Thanks!
@Namzakku hey! yes, it should work with any inputs actually. Can you show the error you encountered and the minimal code to reproduce the error?
@zucchini-nlp Thanks for quick reply!
As the read_video_xxx function does not work with image, I changed to using PIL like below in the collate_fn()
def collate_fn(example, path):
image_file = example['image_path']
image = Image.open(os.path.join(path, image_file))
conversation = example['conversations']
prompt = processor.apply_chat_template(conversation, add_generation_prompt=False)
batch = processor(
text=prompt,
images=image,
truncation=True,
max_length=MAX_LENGTH,
return_tensors="pt"
)
print(example['image_path'])
return batch
then I changed "pixel_values_video" in class LlavaNextVideoDataCollatorWithPadding to "pixel_values"
with everything else be the same, I ran trainer and got the below error
RuntimeError Traceback (most recent call last)
Cell In[21], line 1
----> 1 trainer.train()
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1930 hf_hub_utils.enable_progress_bars()
1931 else:
-> 1932 return inner_training_loop(
1933 args=args,
1934 resume_from_checkpoint=resume_from_checkpoint,
1935 trial=trial,
1936 ignore_keys_for_eval=ignore_keys_for_eval,
1937 )
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2230, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2227 rng_to_sync = True
2229 step = -1
-> 2230 for step, inputs in enumerate(epoch_iterator):
2231 total_batched_samples += 1
2233 if self.args.include_num_input_tokens_seen:
File /usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py:454, in DataLoaderShard.__iter__(self)
452 # We iterate one batch ahead to check when we are at the end
453 try:
--> 454 current_batch = next(dataloader_iter)
455 except StopIteration:
456 yield
File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:630, in _BaseDataLoaderIter.__next__(self)
627 if self._sampler_iter is None:
628 # TODO(https://github.com/pytorch/pytorch/issues/76750)
629 self._reset() # type: ignore[call-arg]
--> 630 data = self._next_data()
631 self._num_yielded += 1
632 if self._dataset_kind == _DatasetKind.Iterable and \
633 self._IterableDataset_len_called is not None and \
634 self._num_yielded > self._IterableDataset_len_called:
File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1345, in _MultiProcessingDataLoaderIter._next_data(self)
1343 else:
1344 del self._task_info[idx]
-> 1345 return self._process_data(data)
File /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:1371, in _MultiProcessingDataLoaderIter._process_data(self, data)
1369 self._try_put_index()
1370 if isinstance(data, ExceptionWrapper):
-> 1371 data.reraise()
1372 return data
File /usr/local/lib/python3.10/dist-packages/torch/_utils.py:694, in ExceptionWrapper.reraise(self)
690 except TypeError:
691 # If the exception takes multiple arguments, don't try to
692 # instantiate since we don't know how to
693 raise RuntimeError(msg) from None
--> 694 raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/tmp/ipykernel_1223/1722779256.py", line 18, in __call__
padded_inputs["pixel_values"] = torch.cat([feat['pixel_values'] for feat in features], dim=0)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 4 but got size 5 for tensor number 2 in the list.
Ah I see, forgot that llava-next processes images in patches.and each image can contain different number of patches, unlike videos where the number of frames is fixed.
A solution can be to get rid of PaddingCollator and do instead smth like CollateForModel which will take image path and text, and will process then into tensors on the fly, while training. Why i opted for two collates is because saving videos and then loading on-the -fly is slow because they require more memory than images.
Also take a look at this notebook on Llava (https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb). It will be easily customizable for llava-next (in image-text case) by changing a few input args
Thanks for the advice! I will look further into the above notebook!
@Namzakku Hello, in case there are still interests, I've put up a codebase https://github.com/zjysteven/lmms-finetune that supports finetuning of various types of LMMs, including llava-next-interleave and llava-next-video. It took inspiration from the above finetuning notebook by the huggingface staff, but differs from it in some ways including 1) it uses huggingface trainer and 2) it sets labels to only the model responses rather than both user prompts + model responses (as the notebook did). Feel free to try it out and I would appreciate any comments/feedbacks.
Hey can anyone point me to somewhere that specifies the recommended infrastructure to deploy/run this model?
Eg. How much VRAM, CPU cores needed etc etc.
hey, running this code:
import torch
import numpy as np
from transformers import LlavaNextVideoForConditionalGeneration, LlavaNextVideoProcessor
def read_video_pyav(container, indices):
"""
Decode the video with PyAV decoder.
Args:
container (`av.container.input.InputContainer`): PyAV container.
indices (`List[int]`): List of frame indices to decode.
Returns:
result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
"""
frames = []
container.seek(0)
start_index = indices[0]
end_index = indices[-1]
for i, frame in enumerate(container.decode(video=0)):
if i > end_index:
break
if i >= start_index and i in indices:
frames.append(frame)
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
# Load the model in half-precision
model = LlavaNextVideoForConditionalGeneration.from_pretrained(
"llava-hf/LLaVA-NeXT-Video-7B-hf",
torch_dtype=torch.float16,
attn_implementation="flash_attention_2",
).to(0)
processor = LlavaNextVideoProcessor.from_pretrained("llava-hf/LLaVA-NeXT-Video-7B-hf")
# Load the video as an np.array, sampling uniformly 8 frames (can sample more for longer videos)
video_path = "/mnt/nfs/decart/amir/tarsier/429916.mp4"
container = av.open(video_path)
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
video = read_video_pyav(container, indices)
conversation = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Please provide a detailed description of the video, focusing on the main subjects, their actions, the background scenes.",
},
{"type": "video"},
],
},
]
prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(text=prompt, videos=video, return_tensors="pt").to(0)
out = model.generate(**inputs, max_new_tokens=60)
processor.batch_decode(out, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print(out)
I get this error:
Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}
Loading checkpoint shards: 100%|██████████| 3/3 [00:06<00:00, 2.01s/it]
Chat templates should be in a 'chat_template.json' file but found key='chat_template' in the processor's config. Make sure to move your template to its own file.
/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/feature_extraction_utils.py:142: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:261.)
return torch.tensor(value)
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
Traceback (most recent call last):
File "/mnt/nfs/LLaVA-NeXT/llavanext.py", line 60, in <module>
out = model.generate(**inputs, max_new_tokens=60)
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/generation/utils.py", line 2982, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/mnt/nfs/decart/miniconda3/envs/share4video-amir/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/models/llava_next_video/modeling_llava_next_video.py", line 915, in forward
first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
File "/mnt/nfs/miniconda3/envs/share4video/lib/python3.10/site-packages/transformers/cache_utils.py", line 334, in __getitem__
raise KeyError(f"Cache only has {len(self)} layers, attempted to access layer with index {layer_idx}")
KeyError: 'Cache only has 0 layers, attempted to access layer with index 0'```
@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)
Hi @zucchini-nlp I tried to train on multiple gpus but keep receiving the below error. Do you have any idea how to solve it?
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
This is what I got from model.hf_device_map.
{'image_newline': 0,
'vision_tower': 0,
'multi_modal_projector': 0,
'language_model.model.embed_tokens': 0,
'language_model.model.layers.0': 0,
'language_model.model.layers.1': 0,
'language_model.model.layers.2': 0,
'language_model.model.layers.3': 0,
'language_model.model.layers.4': 0,
'language_model.model.layers.5': 0,
'language_model.model.layers.6': 0,
'language_model.model.layers.7': 0,
'language_model.model.layers.8': 0,
'language_model.model.layers.9': 0,
'language_model.model.layers.10': 0,
'language_model.model.layers.11': 0,
'language_model.model.layers.12': 1,
'language_model.model.layers.13': 1,
'language_model.model.layers.14': 1,
'language_model.model.layers.15': 1,
'language_model.model.layers.16': 1,
'language_model.model.layers.17': 1,
'language_model.model.layers.18': 1,
'language_model.model.layers.19': 1,
'language_model.model.layers.20': 1,
'language_model.model.layers.21': 1,
'language_model.model.layers.22': 1,
'language_model.model.layers.23': 1,
'language_model.model.layers.24': 1,
'language_model.model.layers.25': 1,
'language_model.model.layers.26': 1,
'language_model.model.layers.27': 1,
'language_model.model.layers.28': 1,
'language_model.model.layers.29': 1,
'language_model.model.layers.30': 1,
'language_model.model.layers.31': 1,
'language_model.model.norm': 1,
'language_model.lm_head': 1}
@Namzakku Hmm, from the device-map there doesn't seem to be anything to cause device mismatch errors. Can you share the full traceback to see where exactly are tensors on different devices and a minimal reproducer?
@zucchini-nlp Thanks for the response! This is the full traceback
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn(
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[32], line 1
----> 1 trainer.train()
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1932, in Trainer.train(self, resume_from_checkpoint, trial, ignore_keys_for_eval, **kwargs)
1930 hf_hub_utils.enable_progress_bars()
1931 else:
-> 1932 return inner_training_loop(
1933 args=args,
1934 resume_from_checkpoint=resume_from_checkpoint,
1935 trial=trial,
1936 ignore_keys_for_eval=ignore_keys_for_eval,
1937 )
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:2268, in Trainer._inner_training_loop(self, batch_size, args, resume_from_checkpoint, trial, ignore_keys_for_eval)
2265 self.control = self.callback_handler.on_step_begin(args, self.state, self.control)
2267 with self.accelerator.accumulate(model):
-> 2268 tr_loss_step = self.training_step(model, inputs)
2270 if (
2271 args.logging_nan_inf_filter
2272 and not is_torch_xla_available()
2273 and (torch.isnan(tr_loss_step) or torch.isinf(tr_loss_step))
2274 ):
2275 # if loss is nan or inf simply add the average of previous logged losses
2276 tr_loss += tr_loss / (1 + self.state.global_step - self._globalstep_last_logged)
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3307, in Trainer.training_step(self, model, inputs)
3304 return loss_mb.reduce_mean().detach().to(self.args.device)
3306 with self.compute_loss_context_manager():
-> 3307 loss = self.compute_loss(model, inputs)
3309 del inputs
3311 kwargs = {}
File /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:3338, in Trainer.compute_loss(self, model, inputs, return_outputs)
3336 else:
3337 labels = None
-> 3338 outputs = model(**inputs)
3339 # Save past state if it exists
3340 # TODO: this needs to be fixed and made cleaner later.
3341 if self.args.past_index >= 0:
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:825, in convert_outputs_to_fp32.<locals>.forward(*args, **kwargs)
824 def forward(*args, **kwargs):
--> 825 return model_forward(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py:813, in ConvertOutputsToFp32.__call__(self, *args, **kwargs)
812 def __call__(self, *args, **kwargs):
--> 813 return convert_to_fp32(self.model_forward(*args, **kwargs))
File /usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py:16, in autocast_decorator.<locals>.decorate_autocast(*args, **kwargs)
13 @functools.wraps(func)
14 def decorate_autocast(*args, **kwargs):
15 with autocast_instance:
---> 16 return func(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/peft/peft_model.py:762, in PeftModel.forward(self, *args, **kwargs)
760 with self._enable_peft_forward_hooks(*args, **kwargs):
761 kwargs = {k: v for k, v in kwargs.items() if k not in self.special_peft_forward_args}
--> 762 return self.get_base_model()(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 output = module._old_forward(*args, **kwargs)
165 else:
--> 166 output = module._old_forward(*args, **kwargs)
167 return module._hf_hook.post_forward(module, output)
File /usr/local/lib/python3.10/dist-packages/transformers/models/llava_next/modeling_llava_next.py:855, in LlavaNextForConditionalGeneration.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict)
851 attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
853 position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
--> 855 outputs = self.language_model(
856 attention_mask=attention_mask,
857 position_ids=position_ids,
858 past_key_values=past_key_values,
859 inputs_embeds=inputs_embeds,
860 use_cache=use_cache,
861 output_attentions=output_attentions,
862 output_hidden_states=output_hidden_states,
863 return_dict=return_dict,
864 )
866 logits = outputs[0]
868 loss = None
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:1174, in LlamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
1171 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1173 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1174 outputs = self.model(
1175 input_ids=input_ids,
1176 attention_mask=attention_mask,
1177 position_ids=position_ids,
1178 past_key_values=past_key_values,
1179 inputs_embeds=inputs_embeds,
1180 use_cache=use_cache,
1181 output_attentions=output_attentions,
1182 output_hidden_states=output_hidden_states,
1183 return_dict=return_dict,
1184 cache_position=cache_position,
1185 )
1187 hidden_states = outputs[0]
1188 if self.config.pretraining_tp > 1:
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:967, in LlamaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, cache_position)
964 all_hidden_states += (hidden_states,)
966 if self.gradient_checkpointing and self.training:
--> 967 layer_outputs = self._gradient_checkpointing_func(
968 decoder_layer.__call__,
969 hidden_states,
970 causal_mask,
971 position_ids,
972 past_key_values,
973 output_attentions,
974 use_cache,
975 cache_position,
976 )
977 else:
978 layer_outputs = decoder_layer(
979 hidden_states,
980 attention_mask=causal_mask,
(...)
985 cache_position=cache_position,
986 )
File /usr/local/lib/python3.10/dist-packages/torch/_compile.py:24, in _disable_dynamo.<locals>.inner(*args, **kwargs)
20 @functools.wraps(fn)
21 def inner(*args, **kwargs):
22 import torch._dynamo
---> 24 return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py:328, in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
326 dynamic_ctx.__enter__()
327 try:
--> 328 return fn(*args, **kwargs)
329 finally:
330 set_eval_frame(prior)
File /usr/local/lib/python3.10/dist-packages/torch/_dynamo/external_utils.py:17, in wrap_inline.<locals>.inner(*args, **kwargs)
15 @functools.wraps(fn)
16 def inner(*args, **kwargs):
---> 17 return fn(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:451, in checkpoint(function, use_reentrant, context_fn, determinism_check, debug, *args, **kwargs)
446 if context_fn is not noop_context_fn or debug is not False:
447 raise ValueError(
448 "Passing `context_fn` or `debug` is only supported when "
449 "use_reentrant=False."
450 )
--> 451 return CheckpointFunction.apply(function, preserve, *args)
452 else:
453 gen = _checkpoint_without_reentrant_generator(
454 function, preserve, context_fn, determinism_check, debug, *args, **kwargs
455 )
File /usr/local/lib/python3.10/dist-packages/torch/autograd/function.py:539, in Function.apply(cls, *args, **kwargs)
536 if not torch._C._are_functorch_transforms_active():
537 # See NOTE: [functorch vjp and autograd interaction]
538 args = _functorch.utils.unwrap_dead_wrappers(args)
--> 539 return super().apply(*args, **kwargs) # type: ignore[misc]
541 if cls.setup_context == _SingleLevelFunction.setup_context:
542 raise RuntimeError(
543 "In order to use an autograd.Function with functorch transforms "
544 "(vmap, grad, jvp, jacrev, ...), it must override the setup_context "
545 "staticmethod. For more details, please see "
546 "https://pytorch.org/docs/master/notes/extending.func.html"
547 )
File /usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:230, in CheckpointFunction.forward(ctx, run_function, preserve_rng_state, *args)
227 ctx.save_for_backward(*tensor_inputs)
229 with torch.no_grad():
--> 230 outputs = run_function(*args)
231 return outputs
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 output = module._old_forward(*args, **kwargs)
165 else:
--> 166 output = module._old_forward(*args, **kwargs)
167 return module._hf_hook.post_forward(module, output)
File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:715, in LlamaDecoderLayer.forward(self, hidden_states, attention_mask, position_ids, past_key_value, output_attentions, use_cache, cache_position, **kwargs)
694 """
695 Args:
696 hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
(...)
711 into the model
712 """
713 residual = hidden_states
--> 715 hidden_states = self.input_layernorm(hidden_states)
717 # Self Attention
718 hidden_states, self_attn_weights, present_key_value = self.self_attn(
719 hidden_states=hidden_states,
720 attention_mask=attention_mask,
(...)
725 cache_position=cache_position,
726 )
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File /usr/local/lib/python3.10/dist-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 output = module._old_forward(*args, **kwargs)
165 else:
--> 166 output = module._old_forward(*args, **kwargs)
167 return module._hf_hook.post_forward(module, output)
File /usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py:88, in LlamaRMSNorm.forward(self, hidden_states)
86 variance = hidden_states.pow(2).mean(-1, keepdim=True)
87 hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
---> 88 return self.weight * hidden_states.to(input_dtype)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
The code I used is mainly from the below link https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa/Fine_tune_LLaVa_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb
but I modified it to use huggingface trainer instead of pytorch lightning.
The modification is mainly in the LlavaNextDataset and I also customized a train collate in order to support multi-turn conversations like below.
class LlavaNextDataset(Dataset):
"""
PyTorch Dataset for LLaVa-NeXT. This class takes a HuggingFace Dataset as input.
Each row, consists of image path(png/jpg/jpeg) and ground truth data (json/jsonl/txt).
"""
def __init__(
self,
dataset,
split: str = "train",
):
super().__init__()
self.split = split
self.dataset = dataset
self.dataset_length = len(self.dataset)
def __len__(self) -> int:
return self.dataset_length
def __getitem__(self, idx: int) -> Dict:
"""
Returns one item of the dataset.
Returns:
image : the original Receipt image
target_sequence : tokenized ground truth sequence
"""
sample = self.dataset[idx]
# inputs
image = Image.open(os.path.join(DATASET_PATH, sample["image_path"]))
conversations = sample["conversations"]
return image, conversations
class TrainCollate:
def __init__(self, processor, max_length):
self.processor = processor
self.max_length = max_length
def __call__(self, examples):
images = []
texts = []
for example in examples:
image, conversations = example
images.append(image)
text_prompt = self.processor.apply_chat_template(conversations)
texts.append(text_prompt)
batch = self.processor(
text=texts,
images=images,
padding=True,
truncation=True,
max_length=self.max_length,
return_tensors="pt"
)
labels = batch["input_ids"].clone()
labels[labels == self.processor.tokenizer.pad_token_id] = -100
batch["labels"] = labels
return batch
Other than that, how the model is loaded and how peft is applied are the same from the notebook. Below are the training parameters.
args = TrainingArguments(
# args related to training
output_dir = OUTPUT_DIR,
eval_strategy = 'steps',
eval_steps=100,
per_device_train_batch_size = BATCH_SIZE,
per_device_eval_batch_size = BATCH_SIZE,
gradient_accumulation_steps = 16,
learning_rate = 2e-04,
# max_steps = 100, # adjust this depending on your dataset size
lr_scheduler_type = 'cosine',
warmup_ratio = 0.1,
num_train_epochs=1,
# args related to eval/save
logging_strategy="steps",
logging_steps=1,
save_strategy = 'steps',
save_steps=100,
save_total_limit = 1,
fp16 = True, # we have the model train and eval with fp16 precision
fp16_full_eval = True,
optim = 'adamw_bnb_8bit', # adam in lower-bits to save memory, consider changing to 'adamw_torch' if model is not converging
report_to = "wandb", # install wand to use this
run_name = WANDB_NAME,
# hub_model_id = REPO_ID,
# push_to_hub = True, # wel'll push the model to hub after each epoch
# model that was wrapped for QLORA training with peft will not have arguments listed in its signature
# so we need to pass lable names explicitly to calculate val loss
label_names=["labels"],
dataloader_num_workers=16, # let's get more workers since iterating on video datasets might be slower in general
)
trainer = Trainer(
model = model,
tokenizer = processor,
data_collator = TrainCollate(processor, MAX_LENGTH),
train_dataset = train_dataset,
eval_dataset = val_dataset,
args=args,
)
@zucchini-nlp Hi again! Sorry for keep asking questions but could you verify that if the current huggingface model supports multi-turn chat? I try to go through the code in this folder, look at the processing_llava_next_video.py and it mentions preparing the model for serveral sequences, but I wasn't clear how the 2nd and 3rd value forward of the ground truth is handled in the below collate?
def train_collate_fn(examples):
images = []
texts = []
for example in examples:
image, ground_truth = example
images.append(image)
# TODO: in the future we can replace this by processor.apply_chat_template
prompt = f"USER: <image>\nExtract JSON.\nASSISTANT: {ground_truth}"
texts.append(prompt)
batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
labels = batch["input_ids"].clone()
labels[labels == processor.tokenizer.pad_token_id] = -100
batch["labels"] = labels
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
pixel_values = batch["pixel_values"]
labels = batch["labels"]
return input_ids, attention_mask, pixel_values, labels
def eval_collate_fn(examples):
# we only feed the prompt to the model
images = []
texts = []
answers = []
for example in examples:
image, ground_truth = example
images.append(image)
# TODO: in the future we can replace this by processor.apply_chat_template
prompt = f"USER: <image>\nExtract JSON.\nASSISTANT:"
texts.append(prompt)
answers.append(ground_truth)
batch = processor(text=texts, images=images, return_tensors="pt", padding=True)
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
pixel_values = batch["pixel_values"]
return input_ids, attention_mask, pixel_values, answers
given the data example is like below
[
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "prompt1"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": ground_truth1},
],
},
{
"role": "user",
"content": [
{"type": "text", "text": "prompt2"},
],
},
{
"role": "assistant",
"content": [
{"type": "text", "text": ground_truth2},
],
}
]
@Namzakku Yes, it supports multi-turn conversations just the way you have it in the example. You just need to pass in the convo to processor.apply_chat_template() and you'll get a correct prompt.
Regarding training, I guess you want to train on all assistant's turns, not only the last one. In that case you can add processor.apply_chat_template(args, return_assistant_tokens_mask=True) which will return a mask for user's and assistant's turns. I am not sure right now if we have the assistant masking on llava-next-video templates, so let me know if it fails for any of the checkpoints and I'll update the template :)
In any case, you have to prepare labels and mask out unattended parts with -100 in collate function
@zucchini-nlp Thanks for the hint! I have managed to get the labels based on the assistant mask! I found that the chat_template in this processor config is in older version?, which is different from this chat_template.json that you updated a few days ago. After changing the chat template, I managed to get the below labels tensor.
tensor([[ 1, -100, -100, -100, -100, -100, -100, -100, -100, -100,
-100, -100, -100, -100, -100, 5962, 29918, 509, 2806, 29896,
3148, -100, -100, -100, -100, -100, -100, -100, -100, -100,
5962, 29918, 509, 2806, 29906, 29871]]))
with the below train collate function
def train_collate_fn(examples):
images = []
texts = []
assistant_masks_batch = []
for example in examples:
image, conversations = example
images.append(image)
text_prompt = processor.apply_chat_template(conversations)
texts.append(text_prompt)
# assistant masks
assistant_masks_output = processor.apply_chat_template(conversations, return_assistant_tokens_mask=True, return_dict=True, tokenize=True)
assistant_masks_batch.append(assistant_masks_output["assistant_masks"])
batch = processor(text=texts, images=images, padding=True, truncation=True, max_length=MAX_LENGTH, return_tensors="pt")
labels = batch["input_ids"].clone()
for label, assistant_masks in zip(labels, assistant_masks_batch):
for index, mask in enumerate(assistant_masks):
if mask != 1:
label[index + 1] = -100
labels[labels == processor.tokenizer.pad_token_id] = -100
batch["labels"] = labels
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
pixel_values = batch["pixel_values"]
image_sizes = batch["image_sizes"]
labels = batch["labels"]
return input_ids, attention_mask, pixel_values, image_sizes, labels
I'm also trying to figure out ways to do with the evaluation collate function. Can you provide some hints?
@Namzakku oh yeah, the processor config is actually deprecated and I'll delete it soon (totally forgot it's still there). You need a new transformers version so that chat template is loaded from it own file
For eval you should be able to do the same thing, if you want to calculate val-loss. Unfortunately HFTrainer doesn't support generation while evaluating yet, so I guess you need only val-loss
@zucchini-nlp oh yeah, I also noticed that HFTrainer doesn't support generation so I am now switching back to using lightning pytorch. I was wondering how to apply the assistant masks in the generation code, as the labels are not used in the validation_step in the notebook.
@Namzakku I see, if you have your own validation loop you still can apply attention masks and prepare inputs for the loss (in case you want to keep track of val-loss).
An if you want to run generation in validation loop, you need to crop the assistant's last turn and add_generation_prompt=True when applying the chat template. Feeding the output to model will give an assistant's answer on which you can calculate other metrics, like exact match or bleu/rouge etc scores
Oh, also, regarding the device mismatch. I couldn't reproduce it when running inference. Can you make sure you have the latest accelerate version
@zucchini-nlp yeah I have the latest version of accelerate but still not working with multigpu training from notebook. When I switch to using scripts, it works well. I just wonder why the pytorch lightning version takes lots of gpu comparing to hftrainer. I barely fit batch = 1 to a 80GB gpu when using pytorch lightning... but can get bigger batch on 48GB GPU with hftrainer. But the outcome from the pytorch lightning seems better..
An if you want to run generation in validation loop, you need to crop the assistant's last turn and add_generation_prompt=True when applying the chat template. Feeding the output to model will give an assistant's answer on which you can calculate other metrics, like exact match or bleu/rouge etc scores
Is there anyway I can get the assistant's answer in the in-between turns?
Is there anyway I can get the assistant's answer in the in-between turns?
No, the model can only generate next several tokens. For that you have to write your own generation loop which will get assistant't answer and append user's next turn to the conversation. And that way generates the whole conversation turn-by-turn
I just wonder why the pytorch lightning version takes lots of gpu comparing to hftrainer
Noted, will look into that and the flaky device mismatch errors. Several people already reported it, so probably composite models aren't correctly placed on devices with accelerate. I'll add these to my TODO list, will need more time to dig into
@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)
@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)
@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made a PR to fix it and prob will do a patch release today, after that you can install transformers from source :)
Dear authors, I have encountered the same issue when running the inference code (as shown in images attached). Do you have any ideas about how to fixed it? Thanks.
@Holmes-GU which version of transformers you are using? Can you update to the latest release and check once more?