Raushan Turganbay

Results 117 comments of Raushan Turganbay

@Luodian Hey again! Just wanted to check in and see if you had any updates on this. Thanks!

@Luodian I see, thanks. Implementing the current state of "llava-next-video" sounds good for me. The model shows very good performance on videos, and we can give it more visibility on...

The model is added to Transformers and will be part of the next 4.42 release! Please find all checkpoints here: https://huggingface.co/collections/llava-hf/llava-next-video-6666a9173a64c7052930f153 🤗

@HarperGG please take a look at the colab notebook for inference, there are code snippets for batch inference near the end. @zhengrongz yes, the config was missing rope scaling factor....

@zhengrongz you can strip off the prompt text based on input length similar to below: ```python inputs = processor(prompt, videos=clip, return_tensors="pt") input_length = inputs.input_ids.shape[[-1] output = model.generate(**inputs) output_stripped = output[:,...

@Namzakku hey! yes, it should work with any inputs actually. Can you show the error you encountered and the minimal code to reproduce the error?

Ah I see, forgot that llava-next processes images in patches.and each image can contain different number of patches, unlike videos where the number of frames is fixed. A solution can...

@ameeramer right, in the last release we made some changes in backbone LLM which caused errors on LLaVA-NeXT-Video. Made [a PR](https://github.com/huggingface/transformers/pull/32527) to fix it and prob will do a patch...

@Namzakku Hmm, from the device-map there doesn't seem to be anything to cause device mismatch errors. Can you share the full traceback to see where exactly are tensors on different...

@Namzakku Yes, it supports multi-turn conversations just the way you have it in the example. You just need to pass in the convo to `processor.apply_chat_template()` and you'll get a correct...