> > LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

Open Amark-cheey opened this issue 1 year ago • 0 comments

          > > LLaVA-1.5 uses 336px image resolution, so you should change the clip model and control max context length. Also, the image token length is set to 256 by default, but when the resolution is changed to 336, the image token length should be set to 576. Overall, some implementation details need further consideration to adapt to llava-1.5. You should check that in detail.

The use of flash-attn should not affect the final performance.

I used these settings in LLaVA 1.5, but there are still some errors in certain parts of the configuration. May I ask for some guidance? pred_embeddings = last_hidden_state[seg_token_mask] [rank0]: IndexError: The shape of the mask [8, 348] at index 1 does not match the shape of the indexed tensor [8, 668, 336] at index 1

l are trying to change 255 to 575 ,running successfully

Originally posted by @bxhsort in https://github.com/dvlab-research/LISA/issues/82#issuecomment-2490718764

Nov 21 '24 11:11 Amark-cheey