mlx-vlm LLaVa-NeXT needs tweaking for v4.47

I know that you think that Llava has been superseded but I think that it's still pretty good for captioning.

When I use your example script on mlx-community/llava-v1.6-34b-8bit, it warns that:

Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

I have no idea what this means, but it'd be great if mlx-vlm could be tweaked to give the model what it wants.

Oct 25 '24 22:10 jrp2014

Hey,

This is not an issue yet,

HF is changing their processor setup in the future v4.47.

All you need to do is patch the patch_size and vision_feature_select_strategy on the processor config.

Nov 16 '24 02:11 Blaizzy

Is there some documentation for what the default / full vision_feature_select_strategy settings do?

What is the strategy for determining patch_size?

Nov 23 '24 19:11 jrp2014

Here you go: https://huggingface.co/docs/transformers/en/model_doc/llava#usage-tips

What is the strategy for determining patch_size?

It's fixed and defined during model pretraining. If you change it, the model might perform poorly or not run because it affects number of image tokens.

Dec 21 '24 12:12 Blaizzy