VILA Question for the inference efficiency experiment

Question for the inference efficiency experiment

Open cokeshao opened this issue 7 months ago • 6 comments

trafficstars

Thanks for your nice work! In Figure 5 in the paper NVILA: Efficient Frontier Visual Language Models, token compression seems to have no benefit to either TTFT or throughput. Why is that?

Apr 03 '25 15:04 cokeshao

@zhijian-liu can you help clarify?

Apr 06 '25 14:04 Lyken17

Hi @cokeshao, why you saying this? In the figure it is clear that it improves LLM backbone performance in both TTFT and throughput:

Apr 12 '25 23:04 danigarciaoca

Thanks for your reply @danigarciaoca .

As for image input, I see that your result is different. By the way, I would like to know how you get the TTFT and throughput. Did you release the code? Thanks so much.

Apr 13 '25 01:04 cokeshao

Oh, I see, my screenshot was regarding video input.

So maybe the figure was obtained by only considering temporal token compression, or is a errata and for image input they should have considered spatial token compression. Anyway, authors will be able to clarify it.

Apr 14 '25 12:04 danigarciaoca

Oh, I see, my screenshot was regarding video input.

So maybe the figure was obtained by only considering temporal token compression, or is a errata and for image input they should have considered spatial token compression. Anyway, authors will be able to clarify it.

Same question here https://github.com/NVlabs/VILA/issues/233

Apr 28 '25 09:04 lxt160980

@zhijian-liu @ys-2020 knows more the details and comment.

Apr 30 '25 15:04 Lyken17

VILA VILA copied to clipboard

Question for the inference efficiency experiment

VILA
VILA copied to clipboard