llm-awq icon indicating copy to clipboard operation
llm-awq copied to clipboard

Support NVILA 15B

Open anhnhust opened this issue 11 months ago • 7 comments

Hi, Do you have plans to support NVILA 15B in this repository? If yes, how soon do you expect it to be ready?

Thank you!

anhnhust avatar Dec 24 '24 09:12 anhnhust

Thank you for your reaching out. As far as I know, the NVILA-15B model is very similar to the 8B model. Maybe you can just try it with TinyChat 2.0.

Louym avatar Jan 07 '25 12:01 Louym

@Louym

Does tinychat not support multi-GPU load inference?

YoungjaeDev avatar Feb 04 '25 09:02 YoungjaeDev

Thank you for your reaching out. As far as I know, the NVILA-15B model is very similar to the 8B model. Maybe you can just try it with TinyChat 2.0.

Thank you for your question. Since tinychat is mainly focusing on edge AI, we do not support multi-GPU inference for the time being.

Louym avatar Feb 04 '25 10:02 Louym

@Louym

When running the 8b-video model on a 4090 GPU, I'm getting an OOM (Out of Memory) error. Would it be okay to proceed with AWQ quantization after reducing the num_image_tokens?

YoungjaeDev avatar Feb 04 '25 10:02 YoungjaeDev

Thank you for reaching out. According to my experience, you can run W4A16 quantized NVILA 8B model with no more than 128 frames on a single 4090 GPU.

Louym avatar Feb 04 '25 11:02 Louym

Thank you for reaching out. According to my experience, you can run W4A16 quantized NVILA 8B model with no more than 128 frames on a single 4090 GPU.

Are you referring to the 8b-video model? I'll modify the configuration to 128 and try applying AWQ ?

YoungjaeDev avatar Feb 04 '25 13:02 YoungjaeDev

@YoungjaeDev Yes, you can have a try!

Louym avatar Feb 06 '25 03:02 Louym