VILA icon indicating copy to clipboard operation
VILA copied to clipboard

Question About the NVILA TTFT Ablation in the Paper

Open lxt160980 opened this issue 11 months ago • 0 comments

Great Jobs! I have a few doubts about your results on the NVILA TTFT ablation, and I am hoping to get some help here. Image In this figure, the TTFT breakdown shows that NVILA- FP16 & NVILA- FP16 +Token Compression have the same VisionTower Time and LLM Time, which is quite confusing. Cuz in the report you mentioned " scales up then compress“ by passing more tiles independently to the encoder first then STC compression. If NVILA- FP16 itself includes the "scales up" part, after adding token compression, the LLM time should have been shorter for less token. If Token Compression includes the "scales up" part, wouldn't vision tower take up more computation in this way, thus longer TTFT? Or there's some sort of parallelism like threads you didn't mention in this paper, enabling the tiles to be operated at the same time? Hope my question won't bother you Great day😊

lxt160980 avatar Apr 28 '25 09:04 lxt160980