nm-vllm
nm-vllm copied to clipboard
[Timings] Add the ability to log times for async and sync calls
Summary
- Add the ability to time function calls
- Will be enabled unless the
--disable-log-statscli arg is used for the server as the timer's init and average calculations are now all done within theStatLogger - Once enabled, all functions decorated with
@log_timeand@log_async_timewill be timed and added to a list to track measurements for every server request made - Average time values are computed and printed to the cli after a time interval has passed (controlled by the
StatLogger) - Measurements are cleared after the average is calculated
Remaining Questions:
- Currently using the python logger to log the times to the cli; do we want to print instead?
Testing:
The following can now be used to enable time logging while the server is running:
python -m vllm.entrypoints.api_server --model neuralmagic/OpenHermes-2.5-Mistral-7B-pruned50 --port 5000
For any case where we want to time arbitrary blocks of code, without the use of decorators, the following is an example of how the code can be updated:
from timings.utils import get_singleton_manager
with get_singleton_manager().time("some_name_to_track"):
x = numpy.sum(...)
Can you include simple test or a starter code and have an example of how to access the timings please! I think if you can set max_tokens arg in vLLM to (maybe) fix the number of calls.