kernel-memory icon indicating copy to clipboard operation
kernel-memory copied to clipboard

[Feature Request] Auto-throttling the embedding generation speed thru the use of x-ratelimit-* headers

Open 0x7c13 opened this issue 11 months ago • 1 comments

Context / Scenario

I was trying to ingest a large (26MB) PDF file using s Serverless KM instance locally the other day and found that it took really long time for the indexing/embedding to complete. I was trying to profiling the code and realized that the actual extraction process happens really quick.

The reason it took so long is because the GenerateEmbeddingsHandler calls the ITextEmbeddingGenerator in a synchronized foreach loop fashion. We could theoretically convert the existing code to use Parallel.ForEach instead to drastically improve the embedding speed since the embedding for partitionFiles are not logically coupled.

Example:

ConcurrentDictionary<string, DataPipeline.GeneratedFileDetails> newFiles = new();

Parallel.ForEach(uploadedFile.GeneratedFiles, new ParallelOptions { MaxDegreeOfParallelism = ... },
    async (generatedFile, state) => 
    { 
        ...
        newFiles.TryAdd(embeddingFileName,  embeddingFileNameDetails);
    }     

However, although it works for me but this is still not an ideal solution since both OpenAI and AzureOpenAI has a built-in rate limiter that prevent clients from abusing the endpoint.

But the point is, even without converting the code to Parallel.ForEach, we could still be seeing 429 errors because there is no guarantee that the rate limit is safe without knowing the context, especially if we run multiple KM instances at the same time which potentially calls the embedding API at the same time as well.

The problem

We could implement our own GenerateEmbeddingsHandler or even a better ITextEmbeddingGenerator impl to do parallel embeddings with the control of 429 errors thru exponential retries, but still this is not an ideal solution since we need to carefully configure the KM or even multiple KM instances to understand the max TPM we could use for the model or the embedding service provider we choose at any given moment.

Luckily, OpenAI service as well as AzureOpenAI service provide the context of rate limiting information as part of the response for both Chat and Embedding REST APIs in the headers:

FIELD SAMPLE VALUE DESCRIPTION
x-ratelimit-limit-requests 60 The maximum number of requests that are permitted before exhausting the rate limit.
x-ratelimit-limit-tokens 150000 The maximum number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-remaining-requests 59 The remaining number of requests that are permitted before exhausting the rate limit.
x-ratelimit-remaining-tokens 149984 The remaining number of tokens that are permitted before exhausting the rate limit.
x-ratelimit-reset-requests 1s The time until the rate limit (based on requests) resets to its initial state.
x-ratelimit-reset-tokens 6m0s The time until the rate limit (based on tokens) resets to its initial state.

So we could in theory use this atomic information in the response to decide when to scale up and down for the embedding speed to make sure we are using the service at maximum without abusing at the same time. And don't forget, it would be extremely useful under the context of multiple KM instances running at same time. So we don't need to inform the distributed KMs of the current rate limiting knowledge.

We probably don't need to go this far for Chat APIs or Chat use cases but it is very applicable and valuable for embedding scenarios.

Proposed solution

Here are the things that need to be implemented to achieve what I have described above if we are going to do it:

  • Expose x-ratelimit-* headers in the OpenAIClientCore class in Microsoft.SemanticKernel.Connectors.OpenAI for Embedding APIs (Nice to have for Chat APIs as well) => This requires changes in the SK repo (But I know @dluc you are the architect for both SK and KM so I am not going to create a new issue there:)).
  • Surfacing above headers/info to the OpenAITextEmbeddingGenerationService and AzureOpenAITextEmbeddingGenerationService
  • Now, we could go different approaches from here:
    • Plan A: Implement rate limiting logic in the GenerateEmbeddingsAsync API of the TextEmbeddingGenerationService itself using the x-ratelimit-* information. Basically, it should wait for sometime before actually invoking the OpenAI API if there aren't many tokens remaining or rate is too high at the moment. And then we could blindly converting the existing foreach loop in the GenerateEmbeddingsHandler to be a Parallel.ForEach loop.
    • Plan B: Instead of implementing the logic inside the GenerateEmbeddingsAsync API, we could implement the rate limiting logic inside GenerateEmbeddingsHandler so that we keep GenerateEmbeddingsAsync API lightweighted. But this approach requires rewrite of the embedding logic. Ideally converting the foreach loop into a queue based ingestion loop where we control the flow speed by the x-ratelimit-* information.

Importance

would be great to have

0x7c13 avatar Mar 23 '24 12:03 0x7c13