azure-search-openai-demo
azure-search-openai-demo copied to clipboard
Token count
Is there a method to get the number of tokens tokenized?
Are you trying to get the final token usage used from the whole flow? We display that in the Thought Process tab, and we get that from the "usage" details on the chat completion response. Or are you trying to get something else?
@pamelafox Thank you. I meant before even making the chat request, when the RAG documents are split and tokenized for embeddings/vectorization. Basically I’m trying to see how many tokens all my text chunks turn into at that stage (for embeddings/vectorization/indexing), not just the final usage tokens in the chat response.
Let me know if more clarification is needed.
Ah I see, okay. We currently only show the amount of tokens that we embed in a batch, with this output from the prepdocs script:
Computed embeddings in batch. Batch size: 16, Token count: 2334
If you want to see the size of each individual chunk, we don't currently log that out. I think the easiest place to add the log is in embeddings.py, where we measure the token length of the chunks before batching them up. Right after line 83:
text_token_length = self.calculate_token_length(text)
logger.info("Chunk token length: %d", text_token_length)
Here's a guide to our text splitting algorithm by the way, if you're wondering why the text was split the way it was:
https://github.com/Azure-Samples/azure-search-openai-demo/blob/main/docs/textsplitter.md
Let me know if additional guidance is needed.
Thanks @pamelafox ! this is helpful. So for logging for the already-indexed docs, I can’t retroactively log token counts without reprocessing them, right?
I’m going to provide a bit more context on why I'm asking about this. Feel free to provide your opinion as it was be highly valued. Basically I’m trying to do a cost analysis for our setup and see how each service scales, whether it’s more dependent on tokens/data size or seats/users, or just fixed baseline infra. My thought was to model cost roughly as something like:
Total Cost ≈ α × (Tokens) + β × (Seats) + γ (Fixed)
So i'd go into the services this app uses and categorize what each service belongs to. I was thinking:
- Azure Cognitive Search: Tokens (index/query grows with tokenized data size)
- Microsoft Defender for Cloud: Seats (per resource monitored, like infra-seats)
- Container Registry: Fixed ?
- Azure Container Apps: Fixed + Seats (baseline fixed, scales with load/users)
- Azure Storage: Tokens
- Log Analytics: perceived tokens, not sure
- Bandwidth: Seats
- Azure Applied AI Services: Tokens (per 1k tokens or per-page API usage)
The idea is to then combine total tokens processed per month + expected users to project cost per token
Feel free to let me know if that breakdown seem reasonable, if i am mischaracterizing some of these, or I just don't make sense :)
The current cost estimate section is worth a reference if you haven't started with that already: https://github.com/Azure-Samples/azure-search-openai-demo?tab=readme-ov-file#cost-estimation
For the non-multimodal deployment, I believe the primary per-token cost is Azure OpenAI Chat Completions and Embeddings. (Search storage and Blob storage is based off document file size, not token count). For the Chat Completions, you would need to look at both the input tokens and the output tokens, but the output tokens varies widely based off the question.
Given the differences, you could consider doing the estimation by evaluating costs for the higher amounts of tokens. e.g. If you retrieve 3 results, and each result is 500 tokens, and the response is 1000 tokens, that's 1500 tokens for the chat completion. You'd be over-estimating there, but otherwise you'll need to find a different way to account for token variance.
As for measuring token count once they're in the index, you could write a script that iterates through the document chunks in the index, then uses tiktoken to compute the token length, using the same function we use to count them during ingestion. (I'm just not sure if it's worth it to get that level of granularity)