rig
rig copied to clipboard
bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking
- [x] I have looked for existing issues (including closed) about this
Bug Report
When using EmbeddingBuilder::build() with OpenAI's text-embedding-3-small model, I encountered the following error:
Error: ProviderError: {
"error": {
"message": "Requested 510944 tokens, max 300000 tokens per request",
"type": "max_tokens_per_request",
"param": null,
"code": "max_tokens_per_request"
}
This occurred despite the implementation's provision to chunk requests based on M::MAX_DOCUMENTS. Upon investigation, it appears that while the method chunks documents by count, it doesn't account for the total token count per batch. Given that OpenAI's API imposes a maximum token limit per request, this oversight can lead to such errors when the cumulative tokens of the documents in a batch exceed the allowed limit.
Reproduction
let mut builder = EmbeddingsBuilder::new(model.clone());
for (i, chunk) in chunks.into_iter().enumerate() {
builder = builder.document(Document {
id: format!("chunk_{}", i),
content: chunk,
})?;
}
let embeddings = builder.build().await?;
Where chunks are strings of roughly 2000 characters each from a 500 pages PDF.
Expected behavior
The builder should automatically group documents into batches that respect OpenAI's total token limit per request (e.g., 300,000 tokens), avoiding API errors due to oversized requests.
Additional context
In my case, the issue was resolved by manually splitting the documents into two batches and calling build() separately on each. However, this workaround defeats the purpose of automatic batching, and a proper fix would be to support token-aware batching logic inside the EmbeddingBuilder.
I'd be happy to try and fix the issue since I need to do it for my project anyway. Would you be interested in a (novice) contribution? If so, I'd appreciate some guidelines and directions for me to follow.
Hi @ProHaller, I believe we've talked on Discord already.
That would be great! The main thing is probably that you just need to make sure to estimate how many tokens are being used up (if we're not using tiktoken then probably content.len() / 4 as discussed) and then chunk it according to that.
Yes, we did. :)
Content.len() / 4 feels a bit hacky, I'll try to go the Tiktoken route.
I am rather new to Rust and programming in general, and have never made a feature contribution until now, so I'll try to report my process and actions to allow for easier correction and guidance. This will make this thread a bit noisy as a consequence. Do tell me if I should stop.
Update
Here what I've done so far:
Actions
- Forked the repo over here: ProHaller/Rig
- Tried to familiarize myself with Rig code base structure, especially the implementation of build()
- Read about Tiktoken which appears to be usable not only for OpenAI models but extendable through the
tiktoken_ext - There is already a library implementing Tiktoken in Rust by zurawiki
Questions
- TikToken base implementation is actually written in Rust with Python bindings (MIT license). Should we reuse the tokenizer logic directly instead of going through external libraries?
- We should probably add a
const CONTEXT_SIZEon theEmbeddingModeltrait to allow each model to set their own and fallback to a sensible default (like4096in tiktoken-rs) - Passing through the Tokenizer every time we build an embedding request would create significant delays, if, like I think, the
max_token_per_requestand similar errors do not cost any credit, maybe asplit_and_retrymethod based on the error response would make more sense? - Since there is already a MAX_DOCUMENTS chunking, could we calculate the max_token_per_document and split the request accordingly?
Anyway, that's all for now, I'd love to hear your thoughts. (@joshua-mo-143) I'll keep flooding the thread with my jumbled mumblings until I'm told to stfu.
- I would look into how many dependencies the
tiktoken-rswould add to the project - It'd be good for this value to be configurable since the rate-limit can change based on a bunch of factors
- Yeah, this is why approximating might be a bit better, you could tokenize until you find the max and then use the length of the as the break point
- See 3
Thanks for the answer
- Good point. It doesn't look too bad, but some of the dependencies are OpenAI specific, which is not ideal
- Yeah, that seems best indeed. Sensible default and easy modification would be good.
- I'll do some tests with different texts and languages to see if I can approximate something useful first then.
After doing some quick tests on Tiktoken, I get the following for OpenAI:
- Alphabetical languages (English, French, Russian…): Token number = Text * ~24%
- Japanese and Chinese: Text * ~74%
- Korean: Text * ~60%
Which introduces the complexity of determining the original language before applying the approximated limit. And that's going to be slightly different for each model too.
To reduce the guess work, and limit the performance hit, we could process the first chunk of each document through TikToken and use that result as an estimation that would account for the type of content sent, provided each document is uniform.
But maybe that's premature optimization, I'll first integrate an optional tokenizer.
Would it be a good idea to have a split_and_retry on the build method?
I tried the following (with OpenAI) and it works rather well.
That would allow for a correct request to work without the overhead of tokenizing, and provide a fallback for oversized ones. With a little work on the error parsing per model, that could be a quick-and-dirty solution.
We could also allow for a recursive retry_depth argument on .build() or wrap the method in an optional .build_with_retry(depth) resilient build method?
// HACK: Try to embed texts and falls back on splitting the embeddings in half if
// max_tokens_per_request is exceeded.
let embeddings = match self.model.embed_texts(docs.clone()).await {
Ok(results) => results,
// HACK: This will not adapt to all API error messages, it would be better to have
// proper Error parsing by model
Err(e) if e.to_string().contains("max_tokens_per_request") => {
let mut first_docs = docs;
// HACK: It would be better to allow for controlled splitting strategy
let second_docs = first_docs.split_off(first_docs.len() / 2);
let mut embeddings = Vec::new();
embeddings.extend(self.model.embed_texts(first_docs).await?);
embeddings.extend(self.model.embed_texts(second_docs).await?);
embeddings
}
Err(e) => return Err(e),
};
Hey @ProHaller! Apologies for the late response, we've been busy internally so external communications have been a bit quiet.
Yes I think that makes sense. Perhaps later on we may refactor it to automatically attempt to split if we can figure out how to get the exact size beforehand so we don't have to fail the request first.