Question About the NVILA TTFT Ablation in the Paper
Great Jobs! I have a few doubts about your results on the NVILA TTFT ablation, and I am hoping to get some help here.
In this figure, the TTFT breakdown shows that NVILA- FP16 & NVILA- FP16 +Token Compression have the same VisionTower Time and LLM Time, which is quite confusing.
Cuz in the report you mentioned " scales up then compress“ by passing more tiles independently to the encoder first then STC compression.
If NVILA- FP16 itself includes the "scales up" part, after adding token compression, the LLM time should have been shorter for less token.
If Token Compression includes the "scales up" part, wouldn't vision tower take up more computation in this way, thus longer TTFT? Or there's some sort of parallelism like threads you didn't mention in this paper, enabling the tiles to be operated at the same time?
Hope my question won't bother you
Great day😊