VILA
VILA copied to clipboard
Question for the inference efficiency experiment
Thanks for your nice work! In Figure 5 in the paper NVILA: Efficient Frontier Visual Language Models, token compression seems to have no benefit to either TTFT or throughput. Why is that?
@zhijian-liu can you help clarify?
Hi @cokeshao, why you saying this? In the figure it is clear that it improves LLM backbone performance in both TTFT and throughput:
Thanks for your reply @danigarciaoca .
As for image input, I see that your result is different. By the way, I would like to know how you get the TTFT and throughput. Did you release the code? Thanks so much.
Oh, I see, my screenshot was regarding video input.
So maybe the figure was obtained by only considering temporal token compression, or is a errata and for image input they should have considered spatial token compression. Anyway, authors will be able to clarify it.
Oh, I see, my screenshot was regarding video input.
So maybe the figure was obtained by only considering temporal token compression, or is a errata and for image input they should have considered spatial token compression. Anyway, authors will be able to clarify it.
Same question here https://github.com/NVlabs/VILA/issues/233
@zhijian-liu @ys-2020 knows more the details and comment.