llm icon indicating copy to clipboard operation
llm copied to clipboard

Ability to calculate token-per-second speeds, including recording time to first token

Open simonw opened this issue 7 months ago • 7 comments

Currently we record duration_ms and input_tokens and output_tokens but that's not quite enough. I'd like to start recording the time that the first token arrived too, that way we can produce separate tokens-per-second for the reading tokens and the outputting tokens phases.

simonw avatar Apr 20 '25 19:04 simonw

Maybe a migration like this:

@migration
def m017_first_token_ms(db):
    db["responses"].add_column("first_token_ms", int)

Then record that value as response.first_token_ms at this point in the code, when the first chunk is yielded:

https://github.com/simonw/llm/blob/fa34d7d45279f176bde19eeb78d20135227bbc52/llm/models.py#L547-L561

simonw avatar Apr 20 '25 19:04 simonw

There's one edge-case where this could break: the llm-anthropic plugin has an option to output prefill text like this:

https://github.com/simonw/llm-anthropic/blob/6c23b89c20dccf138c3e37d31853a796b5551106/llm_anthropic.py#L391-L403

        if stream:
            with messages_client.stream(**kwargs) as stream:
                if prefill_text:
                    yield prefill_text
                for chunk in stream:
                    if hasattr(chunk, "delta"):
                        delta = chunk.delta
                        if hasattr(delta, "text"):
                            yield delta.text
                        elif hasattr(delta, "partial_json"):
                            yield delta.partial_json
                # This records usage and other data:
                response.response_json = stream.get_final_message().model_dump()

I can address that by modifying the plugin so it only does that when the first token is returned. It would still skew the statistics a bit though - ideally the plugin would have a mechanism for saying "don't include these output tokens in your calculations of tokens-per-second", but I'm not sure if I can come up with a clean abstraction for that which doesn't cause more confusion than is worth.

... actually those tokens won't be included in the official count that the model reports and records in output_tokens so having a hack where the prefill is returned only at the moment the first token comes back would be OK.

simonw avatar Apr 20 '25 19:04 simonw

Also worth noting that in --no-stream situations the time to first token measurement may not actually make sense. Maybe I should record null for those?

simonw avatar Apr 20 '25 19:04 simonw

I'm a little nervous that there's some industry-standard way of calculating this that I'm unaware of. I'm going to take a risk and implement it with documentation as to how I'm calculating - if I turn out to get that wrong I can change how it works and apologize later.

simonw avatar Apr 20 '25 19:04 simonw

For plugins like llm-mlx it would be nice if the time spent loading the model from disk could be excluded from these calculations.

Tricky to design that - would need a mechanism whereby the plugin can say "Actually set the start point at this time X".

Maybe add a new load_time_ms thing which is optional but, if populated, tracks when the model finished loading.

simonw avatar Apr 20 '25 20:04 simonw

I built a prototype. Here's a fun thing where I ran it against mlx-community/Llama-3.2-3B-Instruct-4bit via llm-mlx:

llm -m mlx-community/Llama-3.2-3B-Instruct-4bit 'a poem about a badger' -u
In twilight woods, where shadows play,
A badger wanders, in her secret way.
...
Token usage: 41 input, 235 output
41 input tokens, first token after 794619 ms, estimated 0.05 input tokens/second
, 235 output tokens between 794619 and 796273, estimated 142.08 output tokens/second

It took AGES to load the model the first time for some reason. Subsequent runs of the same prompt were much faster:

Token usage: 41 input, 242 output
41 input tokens, first token after 1217 ms, estimated 33.69 input tokens/second
, 242 output tokens between 1217 and 2826, estimated 150.40 output tokens/second

And:

Token usage: 41 input, 235 output
41 input tokens, first token after 979 ms, estimated 41.88 input tokens/second
, 235 output tokens between 979 and 2557, estimated 148.92 output tokens/second

Strong argument for implementing that model loading time idea.

simonw avatar Apr 20 '25 21:04 simonw

I think this is a sound approach (came here looking for exactly this so I can gauge model performance on low-end machines)

rcarmo avatar May 14 '25 08:05 rcarmo