llm-analysis
llm-analysis copied to clipboard
latency [BUG]
The latency i am getting here and the actual time when i am inferencing are not same. And also there is a huge difference between these two. So could be the problem?
Can you share the way you run the tool and the actual time you saw in your benchmarking?
I used the tool by a running slurm job . Whereas the actual time I observed was loading the model and timing it using time module when given a prompt till decoding it . And all this is done on a cpu not gpu.
what happened the tool inference time is not matching with the inference i am taking in real time. I have ran it on v100 gpu card of 16 gb.