TokenPacker About the inference times reported in Figure 4 and Table 3

Hello, I would like to know if the inference times reported in Figure 4 are measured under NO KV cache? While the "TPS" results in Table 3 are prefill time (first token inference time)?

Sep 17 '24 14:09 Osilly

I'm also wonder how the TPS is calculated, can you provide a more detail description about it. Why I evaluate the official LLaVA-1.5 model with 1 A100 40GB GPU and get the average TPS (generated_token_number / generated_time) is about 29 (in the paper, it is 4.9). Is there something wrong?

Sep 19 '24 04:09 AAbathur

@Osilly @AAbathur Hello, we use the below codes to get tokens per sencond( TPS):

total_time = 0 
total_tokens = 0
for index, item in tqdm(questions.iterrows(), ..)
     ...
     start = time.time()
     with torch.inferecne_model():
          output_ids = model.generate(input_ids, images, ..., use_cache=True)
     ...
    end=time.time()
     ...
    tokens = output_ids.shape[1]-input_token_len
    total_tokens+=tokens
    total_time+=end-start

 token_per_second = total_tokens/total_time

KV cache is used with a single A100(80G) GPU.

Sep 20 '24 07:09 LiWentomng

@LiWentomng Thanks for your response! I hope to know why the inference times in Figure 4 between LLaVA-TokenPacker and official LLaVA-1.5 have significant gap? In our experiments, the image token reduction often only accelerates the prefill stage (first token generation) and has almost no impact on subsequent generations when using kv cache (most overhead is in linear layers). This issue is also discussed in pkunlp-icler/FastV#22 . Can you provide more details?

Sep 20 '24 12:09 Osilly