llm
llm copied to clipboard
[Proposal] `llm` Token Cost Tracking
Background
- Large language models provided as a service tend to charge on a per token basis. Prices vary by vendor, and there is often a different charge for "input" and "output" tokens.
- llm-prices.com is a website maintained by Simon Willison
llmis an Open Source "CLI utility and Python library for interacting with Large Language Models, both via remote APIs and models that can be installed and run on your own machine" made by Simon Willison
Introduction
llm currently does not support tracking the costs associated with its use. This proposal aims to add that feature, along with some related ones. The implementation of this feature is more tricky than one might assume, which will be discussed below.
Complicating Factors
Models May Not Have One Single Cost
Many models are available from multiple providers, for instance Bedrock vs Groq. We need a way to denote the costs in a provider centric manner.
Model Costs May Change With Time
Over time, providers are liable to change their per-token pricing. That means that we cannot say definitively what the cost at any given time, as our data source is potentially out of date.
My proposed solution to this issue, holistically, is to track historical pricing for LLM's in a ledger, and calculate the costs based on the time of the API call being made and comparing it to the ledger for that date and time. This means that all cost calculations will be eventually accurate (when the ledger is updated).
Often Not Possible to Predict Token Usage for Many Models and Binary Files
This will make it impossible to predict the token usage prior to sending the API call (and waiting for the response).
For our purposes, this is probably not an issue, but it is something to be aware of.
Proposal
Ledger Option 1: Use litellm's model_prices_and_context_window.json as the source of truth.
- pull this file down at some cadence into a cache to provide pricing info for virtually all LLM providers and models
- provide convenient hooks for plugins to pull this info in
Ledger Option 2: Maintain Our Own, With a Temporal Dimension
A git repo for the ledger shall be created that consists of a list of providers, the models they provide, and a mapping of the token costs as of a certain date. Each API provider will have a separate sub-ledger file so humans don't have too much trouble tracking and appending changes. Each sub-ledger should be in a human editable format, such as YAML or TOML so humans can easily edit it.
Example Ledger Format:
[openai.gpt-4]
entries = [
{ effective_date = "2023-03-01T00:00:00Z", input_price_per_1k_tokens = 0.03, output_price_per_1k_tokens = 0.06 },
{ effective_date = "2024-04-15T00:00:00Z", input_price_per_1k_tokens = 0.03, output_price_per_1k_tokens = 0.06 }
]
[openai.gpt-3.5-turbo]
entries = [
{ effective_date = "2023-01-01T00:00:00Z", input_price_per_1k_tokens = 0.0015, output_price_per_1k_tokens = 0.002 },
{ effective_date = "2023-11-06T00:00:00Z", input_price_per_1k_tokens = 0.001, output_price_per_1k_tokens = 0.002 },
{ effective_date = "2024-04-15T00:00:00Z", input_price_per_1k_tokens = 0.0005, output_price_per_1k_tokens = 0.0015 }
]
[openai.gpt-4o]
entries = [
{ effective_date = "2024-04-15T00:00:00Z", input_price_per_1k_tokens = 0.005, output_price_per_1k_tokens = 0.015 }
]
[openrouter.gpt-4]
entries = [
{ effective_date = "2023-06-15T00:00:00Z", input_price_per_1k_tokens = 0.035, output_price_per_1k_tokens = 0.07 },
{ effective_date = "2024-01-10T00:00:00Z", input_price_per_1k_tokens = 0.025, output_price_per_1k_tokens = 0.055 }
]
[openrouter.gpt-3.5-turbo]
entries = [
{ effective_date = "2023-05-15T00:00:00Z", input_price_per_1k_tokens = 0.0018, output_price_per_1k_tokens = 0.0022 },
{ effective_date = "2024-02-01T00:00:00Z", input_price_per_1k_tokens = 0.0008, output_price_per_1k_tokens = 0.0017 }
]
[anthropic.claude-3-opus]
entries = [
{ effective_date = "2024-03-01T00:00:00Z", input_price_per_1k_tokens = 0.015, output_price_per_1k_tokens = 0.075 }
]
[anthropic.claude-3-sonnet]
entries = [
{ effective_date = "2024-03-01T00:00:00Z", input_price_per_1k_tokens = 0.003, output_price_per_1k_tokens = 0.015 }
]
This ledger will be pulled down much in the same way the litellm json would be.
Potential Related Features
- warnings for high cost calls, configurable by the user what is defined as "high"
See Also
- https://github.com/AgentOps-AI/tokencost
- https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json
- https://yourgpt.ai/tools/openai-and-other-llm-api-pricing-calculator
(For context to other readers, we talked about this briefly in person at PyCon US - this proposal is a continuation of that conversation.)
I created the https://github.com/simonw/llm-prices repo with the goal of building it out into this kind of API - the git history there already tracks some aspects of historical prices, but I'm on board with making that explicit in the JSON.
I'd worried about now knowing exactly what time and in what timezone the new prices went live. I'm inclined to store prices as UTC dates assuming changes at UTC midnight, but allow entries with exact UTC times if that's known.
The one big challenge is that we need unique identifiers for the models that correspond to what's in the plugins. openai/o4-mini seems like a fine format for me, that's what I've been adopting for my most recent plugins like https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-anthropic
For price prediction, I'd really like LLM to grow a token counting features. For both Claude and Gemini you can send a prompt (with attachments and system prompts and tools and such like) to a special endpoint to get back a count without spending the money first. I built a demo of one of those here: https://tools.simonwillison.net/claude-token-counter
I'm fine with only some plugins being able to provide token counting estimates, but these could calculate predicted prices too.
One possible design for that:
llm -m anthropic/claude-3.5-haiku |
-a image.jpg \
"Describe the image" --count-tokens
This would return an integer count as opposed to executing the prompt. Maybe add --estimate to get the price estimate too.
Short versions could be -CE (because -c is taken already - although -e isn't so could use lowercase there. But it's used by llm logs and I would want the same option there.)
The one big challenge is that we need unique identifiers for the models that correspond to what's in the plugins. openai/o4-mini seems like a fine format for me, that's what I've been adopting for my most recent plugins like https://github.com/simonw/llm-openai-plugin and https://github.com/simonw/llm-anthropic
I think a schema that is compatible with yours but covers more ground is something like provider/family/model, but when the provider and the family are the same, they may be consolidated into a single field. For instance, GPT-4 from OpenAI directly is either openai/openai/gpt-4 or openai/gpt-4, but on Azure it is azure/openai/gpt-4 and on OpenRouter it is openrouter/openai/gpt-4.
Alternatively, we could go with what you suggest, which is also what https://github.com/BerriAI/litellm/blob/main/model_prices_and_context_window.json uses. So for the above, they use azure/gpt-4 or openrouter/gpt-4.
Some other random thoughts:
- should we provide an opt-in option for llm to reach out to a 3rd party (the LLM cost api)?
- what happens if the API is unreachable? Just work off the cache presumably, but do we alert the user somehow?
- How often do we hit the API? Once a day at midnight? Assume that models won't change more often than that?
- Do we store temporal/historical pricing locally, or just on the backend? If we check once a day for updates at say, midnight, but we know that at noon tomorrow the pricing is going to change for a model, do we give the new or old prices? If we give temporal prices (aka both with timestamps) then we don't have to worry about this, but does that mean that every client (llm) gets the entire history of pricing? Is there a reason they shouldn't be given that?
- Do we store the pricing in the database, or do we just store token usage and calculate on the fly? if we store usage, and the cache is out of date then we could potentially store bad data forever
- could provide a command to recalculate some or all usage based on the ledger/temporal pricing
- This all may be overthinking - do we anticipate vendors changing prices with no notice?
- "Starting immediately we have reduced our pricing by XX%" seems very realistic.
- Is this a problem that is actually worth the effort to solve at a local level? If a few api calls are a few fractions of a penny off, who cares?
- Maybe we just do a best effort calculation at the time the call is made, and provide a mechanism to recalculate which will correct the values if need be.
I think a schema that is compatible with yours but covers more ground is something like
provider/family/model, but when the provider and the family are the same, they may be consolidated into a single field.
The other thing I've been thinking about recently is how you might run a prompt against gemini-1.5-flash-8b-latest it might execute against gemini-1.5-flash-8b-001 one week and -002 the next.
I'd like to record the actual model that was used somewhere.
- should we provide an opt-in option for llm to reach out to a 3rd party (the LLM cost api)?
I'd be happy to NOT fetch the file unless you use the new -E/--estimate option, and mention that in the documentation.
- what happens if the API is unreachable? Just work off the cache presumably, but do we alert the user somehow?
Work off the cached value and show a stderr warning I think.
- How often do we hit the API? Once a day at midnight? Assume that models won't change more often than that?
I think we hit it on demand if the cached copy is more than 24 hours old.
- Do we store temporal/historical pricing locally, or just on the backend?
I think cache everything - it's still going to be measured in the dozens of KBs.
- If we check once a day for updates at say, midnight, but we know that at noon tomorrow the pricing is going to change for a model, do we give the new or old prices?
Caching the full set of dates and prices solves this.
- Do we store the pricing in the database, or do we just store token usage and calculate on the fly? if we store usage, and the cache is out of date then we could potentially store bad data forever
This is tricky. Storing in the DB has major advantages in that you can use SQL to run sums against the column.
I think it's worth storing the calculated values in the database, and provide tools for recalculating against the kagea pricing information. The price could even be in a estimated_cost column to hint that estimates can be updated with new estimates.
- could provide a command to recalculate some or all usage based on the ledger/temporal pricing
Yes, I like that.
- This all may be overthinking - do we anticipate vendors changing prices with no notice?
I'm sure they will - I don't think I've seen many preannounce price changes so far.
If a few api calls are a few fractions of a penny off, who cares?
I agree, as long as the documentation makes it clear that estimates may be inaccurate.
- Maybe we just do a best effort calculation at the time the call is made, and provide a mechanism to recalculate which will correct the values if need be.
+1 to that.