llm
llm copied to clipboard
Ability to calculate token-per-second speeds, including recording time to first token
Currently we record duration_ms and input_tokens and output_tokens but that's not quite enough. I'd like to start recording the time that the first token arrived too, that way we can produce separate tokens-per-second for the reading tokens and the outputting tokens phases.
Maybe a migration like this:
@migration
def m017_first_token_ms(db):
db["responses"].add_column("first_token_ms", int)
Then record that value as response.first_token_ms at this point in the code, when the first chunk is yielded:
https://github.com/simonw/llm/blob/fa34d7d45279f176bde19eeb78d20135227bbc52/llm/models.py#L547-L561
There's one edge-case where this could break: the llm-anthropic plugin has an option to output prefill text like this:
https://github.com/simonw/llm-anthropic/blob/6c23b89c20dccf138c3e37d31853a796b5551106/llm_anthropic.py#L391-L403
if stream:
with messages_client.stream(**kwargs) as stream:
if prefill_text:
yield prefill_text
for chunk in stream:
if hasattr(chunk, "delta"):
delta = chunk.delta
if hasattr(delta, "text"):
yield delta.text
elif hasattr(delta, "partial_json"):
yield delta.partial_json
# This records usage and other data:
response.response_json = stream.get_final_message().model_dump()
I can address that by modifying the plugin so it only does that when the first token is returned. It would still skew the statistics a bit though - ideally the plugin would have a mechanism for saying "don't include these output tokens in your calculations of tokens-per-second", but I'm not sure if I can come up with a clean abstraction for that which doesn't cause more confusion than is worth.
... actually those tokens won't be included in the official count that the model reports and records in output_tokens so having a hack where the prefill is returned only at the moment the first token comes back would be OK.
Also worth noting that in --no-stream situations the time to first token measurement may not actually make sense. Maybe I should record null for those?
I'm a little nervous that there's some industry-standard way of calculating this that I'm unaware of. I'm going to take a risk and implement it with documentation as to how I'm calculating - if I turn out to get that wrong I can change how it works and apologize later.
For plugins like llm-mlx it would be nice if the time spent loading the model from disk could be excluded from these calculations.
Tricky to design that - would need a mechanism whereby the plugin can say "Actually set the start point at this time X".
Maybe add a new load_time_ms thing which is optional but, if populated, tracks when the model finished loading.
I built a prototype. Here's a fun thing where I ran it against mlx-community/Llama-3.2-3B-Instruct-4bit via llm-mlx:
llm -m mlx-community/Llama-3.2-3B-Instruct-4bit 'a poem about a badger' -u
In twilight woods, where shadows play,
A badger wanders, in her secret way.
...
Token usage: 41 input, 235 output
41 input tokens, first token after 794619 ms, estimated 0.05 input tokens/second
, 235 output tokens between 794619 and 796273, estimated 142.08 output tokens/second
It took AGES to load the model the first time for some reason. Subsequent runs of the same prompt were much faster:
Token usage: 41 input, 242 output
41 input tokens, first token after 1217 ms, estimated 33.69 input tokens/second
, 242 output tokens between 1217 and 2826, estimated 150.40 output tokens/second
And:
Token usage: 41 input, 235 output
41 input tokens, first token after 979 ms, estimated 41.88 input tokens/second
, 235 output tokens between 979 and 2557, estimated 148.92 output tokens/second
Strong argument for implementing that model loading time idea.
I think this is a sound approach (came here looking for exactly this so I can gauge model performance on low-end machines)