rig icon indicating copy to clipboard operation
rig copied to clipboard

bug: EmbeddingBuilder::build() Exceeds OpenAI Token Limit Due to Lack of Token-Based Chunking

Open ProHaller opened this issue 6 months ago • 9 comments
trafficstars

  • [x] I have looked for existing issues (including closed) about this

Bug Report

When using EmbeddingBuilder::build() with OpenAI's text-embedding-3-small model, I encountered the following error:

Error: ProviderError: {
    "error": {
        "message": "Requested 510944 tokens, max 300000 tokens per request",
        "type": "max_tokens_per_request",
        "param": null,
        "code": "max_tokens_per_request"
    }

This occurred despite the implementation's provision to chunk requests based on M::MAX_DOCUMENTS. Upon investigation, it appears that while the method chunks documents by count, it doesn't account for the total token count per batch. Given that OpenAI's API imposes a maximum token limit per request, this oversight can lead to such errors when the cumulative tokens of the documents in a batch exceed the allowed limit.

Reproduction

let mut builder = EmbeddingsBuilder::new(model.clone());
for (i, chunk) in chunks.into_iter().enumerate() {
    builder = builder.document(Document {
        id: format!("chunk_{}", i),
        content: chunk,
    })?;
}
let embeddings = builder.build().await?;

Where chunks are strings of roughly 2000 characters each from a 500 pages PDF.

Expected behavior

The builder should automatically group documents into batches that respect OpenAI's total token limit per request (e.g., 300,000 tokens), avoiding API errors due to oversized requests.

Additional context

In my case, the issue was resolved by manually splitting the documents into two batches and calling build() separately on each. However, this workaround defeats the purpose of automatic batching, and a proper fix would be to support token-aware batching logic inside the EmbeddingBuilder.

ProHaller avatar May 25 '25 06:05 ProHaller

I'd be happy to try and fix the issue since I need to do it for my project anyway. Would you be interested in a (novice) contribution? If so, I'd appreciate some guidelines and directions for me to follow.

ProHaller avatar May 26 '25 02:05 ProHaller

Hi @ProHaller, I believe we've talked on Discord already.

That would be great! The main thing is probably that you just need to make sure to estimate how many tokens are being used up (if we're not using tiktoken then probably content.len() / 4 as discussed) and then chunk it according to that.

joshua-mo-143 avatar May 26 '25 11:05 joshua-mo-143

Yes, we did. :)

Content.len() / 4 feels a bit hacky, I'll try to go the Tiktoken route.

ProHaller avatar May 26 '25 12:05 ProHaller

I am rather new to Rust and programming in general, and have never made a feature contribution until now, so I'll try to report my process and actions to allow for easier correction and guidance. This will make this thread a bit noisy as a consequence. Do tell me if I should stop.

Update

Here what I've done so far:

Actions

  1. Forked the repo over here: ProHaller/Rig
  2. Tried to familiarize myself with Rig code base structure, especially the implementation of build()
  3. Read about Tiktoken which appears to be usable not only for OpenAI models but extendable through the tiktoken_ext
  4. There is already a library implementing Tiktoken in Rust by zurawiki

Questions

  1. TikToken base implementation is actually written in Rust with Python bindings (MIT license). Should we reuse the tokenizer logic directly instead of going through external libraries?
  2. We should probably add a const CONTEXT_SIZE on the EmbeddingModel trait to allow each model to set their own and fallback to a sensible default (like 4096in tiktoken-rs)
  3. Passing through the Tokenizer every time we build an embedding request would create significant delays, if, like I think, the max_token_per_request and similar errors do not cost any credit, maybe a split_and_retry method based on the error response would make more sense?
  4. Since there is already a MAX_DOCUMENTS chunking, could we calculate the max_token_per_document and split the request accordingly?

Anyway, that's all for now, I'd love to hear your thoughts. (@joshua-mo-143) I'll keep flooding the thread with my jumbled mumblings until I'm told to stfu.

ProHaller avatar May 28 '25 04:05 ProHaller

  1. I would look into how many dependencies the tiktoken-rs would add to the project
  2. It'd be good for this value to be configurable since the rate-limit can change based on a bunch of factors
  3. Yeah, this is why approximating might be a bit better, you could tokenize until you find the max and then use the length of the as the break point
  4. See 3

yavens avatar May 28 '25 16:05 yavens

Thanks for the answer

  1. Good point. It doesn't look too bad, but some of the dependencies are OpenAI specific, which is not ideal
  2. Yeah, that seems best indeed. Sensible default and easy modification would be good.
  3. I'll do some tests with different texts and languages to see if I can approximate something useful first then.

ProHaller avatar May 30 '25 22:05 ProHaller

After doing some quick tests on Tiktoken, I get the following for OpenAI:

  1. Alphabetical languages (English, French, Russian…): Token number = Text * ~24%
  2. Japanese and Chinese: Text * ~74%
  3. Korean: Text * ~60%

Which introduces the complexity of determining the original language before applying the approximated limit. And that's going to be slightly different for each model too.

ProHaller avatar May 30 '25 23:05 ProHaller

To reduce the guess work, and limit the performance hit, we could process the first chunk of each document through TikToken and use that result as an estimation that would account for the type of content sent, provided each document is uniform.

But maybe that's premature optimization, I'll first integrate an optional tokenizer.

ProHaller avatar May 30 '25 23:05 ProHaller

Would it be a good idea to have a split_and_retry on the build method? I tried the following (with OpenAI) and it works rather well. That would allow for a correct request to work without the overhead of tokenizing, and provide a fallback for oversized ones. With a little work on the error parsing per model, that could be a quick-and-dirty solution. We could also allow for a recursive retry_depth argument on .build() or wrap the method in an optional .build_with_retry(depth) resilient build method?

                // HACK: Try to embed texts and falls back on splitting the embeddings in half if
                // max_tokens_per_request is exceeded.
                let embeddings = match self.model.embed_texts(docs.clone()).await {
                    Ok(results) => results,
                    // HACK: This will not adapt to all API error messages, it would be better to have
                    // proper Error parsing by model
                    Err(e) if e.to_string().contains("max_tokens_per_request") => {
                        let mut first_docs = docs;
                        // HACK: It would be better to allow for controlled splitting strategy
                        let second_docs = first_docs.split_off(first_docs.len() / 2);
                        let mut embeddings = Vec::new();
                        embeddings.extend(self.model.embed_texts(first_docs).await?);
                        embeddings.extend(self.model.embed_texts(second_docs).await?);
                        embeddings
                    }
                    Err(e) => return Err(e),
                };

ProHaller avatar Jun 01 '25 01:06 ProHaller

Hey @ProHaller! Apologies for the late response, we've been busy internally so external communications have been a bit quiet.

Yes I think that makes sense. Perhaps later on we may refactor it to automatically attempt to split if we can figure out how to get the exact size beforehand so we don't have to fail the request first.

joshua-mo-143 avatar Jun 20 '25 22:06 joshua-mo-143