vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Feature]: Compute and log the serving FLOPs

Open zhuohan123 opened this issue 1 year ago • 10 comments

🚀 The feature, motivation and pitch

vLLM should output the serving FLOPs. This should be helpful for debugging performance and check the GPU utilization.

Alternatives

No response

Additional context

No response

zhuohan123 avatar Mar 19 '24 07:03 zhuohan123

Happy to take this up!

ayusher avatar Apr 02 '24 03:04 ayusher

@ayusher Just out of curiosity, how would you go about doing this? Let's take the simplest case of a single non-sharded model running on a machine. I assume we want to log the actual FLOPs / MACs (multiply accumulates) that the hardware is doing, and not estimate it from the modules of the model e.g. via profiling or just theoretically counting per-module, as this does not take into consideration the kernel implementation of the module (different implementations of attention in e.g. cuda can require different amount of FLOPs, right?). What would be the right way of doing this, where we also do not introduce too much overhead?

gardberg avatar Apr 04 '24 16:04 gardberg

Yeah, after further research this looks like a more involved problem than I initially anticipated. I'm not sure on the best approach or if this is worth considering right now.

ayusher avatar Apr 26 '24 18:04 ayusher

Yeah I agree. I know that some nvidia gpu libraries enable you to get flops, but does not seem like a trivial problem (not a good first issue I guess lol). I'd love to help if we find a manageable way though!

gardberg avatar Apr 26 '24 18:04 gardberg

Thank you all!

Are the flops required to be accurate? Is it okay if the FLOPS numbers are a close estimation?

As discussed by others in this thread it's not very straightforward to calculate FLOPS without digging into internals. That may also add overheads during inference. If okay with a close estimation, I can try to implement a basic FLOPS calculator.

rakshithvasudev avatar Aug 24 '24 03:08 rakshithvasudev

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Nov 23 '24 02:11 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Dec 24 '24 02:12 github-actions[bot]

can I take this up?

krtkvrm avatar Jan 29 '25 09:01 krtkvrm

hi @zhuohan123 is this issue still open to work on ?

gangula-karthik avatar Apr 17 '25 07:04 gangula-karthik

@zhuohan123 @robertgshaw2-redhat @WoosukKwon Hi, is this issue still open? I'd like to take this if it is still needed.

duhaode520 avatar Apr 30 '25 16:04 duhaode520

@duhaode520 Do you have any ideas as to how to approach this?

plops655 avatar May 19 '25 08:05 plops655

I have tried addressing this in my PR. Would be nice if anyone could have a look at it.

sysradium avatar Jun 06 '25 17:06 sysradium

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions[bot] avatar Sep 05 '25 02:09 github-actions[bot]

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

github-actions[bot] avatar Oct 06 '25 02:10 github-actions[bot]