mlx-vlm icon indicating copy to clipboard operation
mlx-vlm copied to clipboard

LLaVa-NeXT needs tweaking for v4.47

Open jrp2014 opened this issue 1 year ago • 3 comments

I know that you think that Llava has been superseded but I think that it's still pretty good for captioning.

When I use your example script on mlx-community/llava-v1.6-34b-8bit, it warns that:

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

I have no idea what this means, but it'd be great if mlx-vlm could be tweaked to give the model what it wants.

jrp2014 avatar Oct 25 '24 22:10 jrp2014

Hey,

This is not an issue yet,

HF is changing their processor setup in the future v4.47.

All you need to do is patch the patch_size and vision_feature_select_strategy on the processor config.

Blaizzy avatar Nov 16 '24 02:11 Blaizzy

Is there some documentation for what the default / full vision_feature_select_strategy settings do?

What is the strategy for determining patch_size?

jrp2014 avatar Nov 23 '24 19:11 jrp2014

Here you go: https://huggingface.co/docs/transformers/en/model_doc/llava#usage-tips

What is the strategy for determining patch_size?

It's fixed and defined during model pretraining. If you change it, the model might perform poorly or not run because it affects number of image tokens.

Blaizzy avatar Dec 21 '24 12:12 Blaizzy