rig bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

trafficstars

[x] I have looked for existing issues (including closed) about this

Bug Report

When using EmbeddingBuilder::build() with OpenAI's text-embedding-3-small model, I encountered the following error:

Error: ProviderError: {
    "error": {
        "message": "Requested 510944 tokens, max 300000 tokens per request",
        "type": "max_tokens_per_request",
        "param": null,
        "code": "max_tokens_per_request"
    }

This occurred despite the implementation's provision to chunk requests based on M::MAX_DOCUMENTS. Upon investigation, it appears that while the method chunks documents by count, it doesn't account for the total token count per batch. Given that OpenAI's API imposes a maximum token limit per request, this oversight can lead to such errors when the cumulative tokens of the documents in a batch exceed the allowed limit.

Reproduction

let mut builder = EmbeddingsBuilder::new(model.clone());
for (i, chunk) in chunks.into_iter().enumerate() {
    builder = builder.document(Document {
        id: format!("chunk_{}", i),
        content: chunk,
    })?;
}
let embeddings = builder.build().await?;

Where chunks are strings of roughly 2000 characters each from a 500 pages PDF.

Expected behavior

The builder should automatically group documents into batches that respect OpenAI's total token limit per request (e.g., 300,000 tokens), avoiding API errors due to oversized requests.

Additional context

In my case, the issue was resolved by manually splitting the documents into two batches and calling build() separately on each. However, this workaround defeats the purpose of automatic batching, and a proper fix would be to support token-aware batching logic inside the EmbeddingBuilder.

May 25 '25 06:05 ProHaller

RIG-749 bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

May 25 '25 06:05 linear[bot]

I'd be happy to try and fix the issue since I need to do it for my project anyway. Would you be interested in a (novice) contribution? If so, I'd appreciate some guidelines and directions for me to follow.

May 26 '25 02:05 ProHaller

Hi @ProHaller, I believe we've talked on Discord already.

That would be great! The main thing is probably that you just need to make sure to estimate how many tokens are being used up (if we're not using tiktoken then probably content.len() / 4 as discussed) and then chunk it according to that.

May 26 '25 11:05 joshua-mo-143

Yes, we did. :)

Content.len() / 4 feels a bit hacky, I'll try to go the Tiktoken route.

May 26 '25 12:05 ProHaller

I am rather new to Rust and programming in general, and have never made a feature contribution until now, so I'll try to report my process and actions to allow for easier correction and guidance. This will make this thread a bit noisy as a consequence. Do tell me if I should stop.

Update

Here what I've done so far:

Actions

Forked the repo over here: ProHaller/Rig
Tried to familiarize myself with Rig code base structure, especially the implementation of build()
Read about Tiktoken which appears to be usable not only for OpenAI models but extendable through the tiktoken_ext
There is already a library implementing Tiktoken in Rust by zurawiki

Questions

TikToken base implementation is actually written in Rust with Python bindings (MIT license). Should we reuse the tokenizer logic directly instead of going through external libraries?
We should probably add a const CONTEXT_SIZE on the EmbeddingModel trait to allow each model to set their own and fallback to a sensible default (like 4096in tiktoken-rs)
Passing through the Tokenizer every time we build an embedding request would create significant delays, if, like I think, the max_token_per_request and similar errors do not cost any credit, maybe a split_and_retry method based on the error response would make more sense?
Since there is already a MAX_DOCUMENTS chunking, could we calculate the max_token_per_document and split the request accordingly?

Anyway, that's all for now, I'd love to hear your thoughts. (@joshua-mo-143) I'll keep flooding the thread with my jumbled mumblings until I'm told to stfu.

May 28 '25 04:05 ProHaller

I would look into how many dependencies the tiktoken-rs would add to the project
It'd be good for this value to be configurable since the rate-limit can change based on a bunch of factors
Yeah, this is why approximating might be a bit better, you could tokenize until you find the max and then use the length of the as the break point
See 3

May 28 '25 16:05 yavens

Thanks for the answer

Good point. It doesn't look too bad, but some of the dependencies are OpenAI specific, which is not ideal
Yeah, that seems best indeed. Sensible default and easy modification would be good.
I'll do some tests with different texts and languages to see if I can approximate something useful first then.

May 30 '25 22:05 ProHaller

After doing some quick tests on Tiktoken, I get the following for OpenAI:

Alphabetical languages (English, French, Russian…): Token number = Text * ~24%
Japanese and Chinese: Text * ~74%
Korean: Text * ~60%

Which introduces the complexity of determining the original language before applying the approximated limit. And that's going to be slightly different for each model too.

May 30 '25 23:05 ProHaller

To reduce the guess work, and limit the performance hit, we could process the first chunk of each document through TikToken and use that result as an estimation that would account for the type of content sent, provided each document is uniform.

But maybe that's premature optimization, I'll first integrate an optional tokenizer.

May 30 '25 23:05 ProHaller

Would it be a good idea to have a split_and_retry on the build method? I tried the following (with OpenAI) and it works rather well. That would allow for a correct request to work without the overhead of tokenizing, and provide a fallback for oversized ones. With a little work on the error parsing per model, that could be a quick-and-dirty solution. We could also allow for a recursive retry_depth argument on .build() or wrap the method in an optional .build_with_retry(depth) resilient build method?

                // HACK: Try to embed texts and falls back on splitting the embeddings in half if
                // max_tokens_per_request is exceeded.
                let embeddings = match self.model.embed_texts(docs.clone()).await {
                    Ok(results) => results,
                    // HACK: This will not adapt to all API error messages, it would be better to have
                    // proper Error parsing by model
                    Err(e) if e.to_string().contains("max_tokens_per_request") => {
                        let mut first_docs = docs;
                        // HACK: It would be better to allow for controlled splitting strategy
                        let second_docs = first_docs.split_off(first_docs.len() / 2);
                        let mut embeddings = Vec::new();
                        embeddings.extend(self.model.embed_texts(first_docs).await?);
                        embeddings.extend(self.model.embed_texts(second_docs).await?);
                        embeddings
                    }
                    Err(e) => return Err(e),
                };

Jun 01 '25 01:06 ProHaller

Hey @ProHaller! Apologies for the late response, we've been busy internally so external communications have been a bit quiet.

Yes I think that makes sense. Perhaps later on we may refactor it to automatically attempt to split if we can figure out how to get the exact size beforehand so we don't have to fail the request first.

Jun 20 '25 22:06 joshua-mo-143

rig rig copied to clipboard

bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

Bug Report

Reproduction

Expected behavior

Additional context

Update

Actions

Questions

rig
rig copied to clipboard