LLaVa-NeXT needs tweaking for v4.47
I know that you think that Llava has been superseded but I think that it's still pretty good for captioning.
When I use your example script on mlx-community/llava-v1.6-34b-8bit, it warns that:
Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.
I have no idea what this means, but it'd be great if mlx-vlm could be tweaked to give the model what it wants.
Hey,
This is not an issue yet,
HF is changing their processor setup in the future v4.47.
All you need to do is patch the patch_size and vision_feature_select_strategy on the processor config.
Is there some documentation for what the default / full vision_feature_select_strategy settings do?
What is the strategy for determining patch_size?
Here you go: https://huggingface.co/docs/transformers/en/model_doc/llava#usage-tips
What is the strategy for determining patch_size?
It's fixed and defined during model pretraining. If you change it, the model might perform poorly or not run because it affects number of image tokens.