ai Azure OpenAI streaming is slow and chunky

Description

I created a test environment with the latest Next.js and AI SDK on app router for OpenAI and OpenAI on Azure with a streaming request.

The OpenAI endpoint streams very smoothly token by token.

The Azure OpenAI endpoint with the same code streams very slow and in big chunks of multiple words or sentences.

This creates a bad UX and feels slow.

I also tried streaming with Langchain and Azure OpenAI but had the same results.

BUT: this is not an Azure failure because I also tested with with Nest.js and and a SSE streaming endpoint. With Nest.js I also have very fast and fluent streaming with Azure. This only happens with Vercel AI SDK right now.

With OpenAI (expected): https://github.com/vercel/ai/assets/12641968/8a969853-19d6-48ba-930f-74593170f1d1

With Azure OpenAI (slow): https://github.com/vercel/ai/assets/12641968/eef08aa2-dc7c-4e44-a52e-673f2ef9446e

Code example

import { AzureKeyCredential, OpenAIClient } from "@azure/openai"; import { OpenAIStream, StreamingTextResponse } from "ai";

import { env } from "~/env.mjs";

const client = new OpenAIClient( https://${env.AZURE_OPENAI_API_INSTANCE_NAME}.openai.azure.com/, new AzureKeyCredential(env.AZURE_OPENAI_API_KEY!), );

export const runtime = "edge";

export async function POST(req: Request) { // const { messages } = await req.json(); const response = await client.streamChatCompletions( env.AZURE_OPENAI_API_DEPLOYMENT_NAME!, [ { content: "What is the meaning of life?", role: "user", }, ], { temperature: 0.4, topP: 0.98, maxTokens: 4096, }, ); const stream = OpenAIStream(response); return new StreamingTextResponse(stream); }

Additional context

"dependencies": { ... "@azure/openai": "1.0.0-beta.11" "ai": "^3.0.2", "next": "^14.1.1", ... }

Mar 03 '24 18:03 christophmeise

I believe its known that Azure streams in larger chunks than OpenAI. Maybe we can provide a utility here, but the best solution is for your client to not rely on the servers chunking strategy if you want a consistent feeling.

https://learn.microsoft.com/en-us/answers/questions/1359927/azure-openai-api-with-stream-true-does-not-give-ch

Mar 03 '24 19:03 MaxLeiter

Thank you! I know that it is possible with Azure because I have built a Nest.js endpoint that streams from Azure just fine with the exact same model and settings.

I used an endpoint with Server-Sent Events (SSE) and manually parsed the stream using microsoft/fetch-event-source in the client. The stream of tokens is super quick and steady but they are unsorted - that means that I needed to fix the sorting client-side so that the response is correct in the UI.

I am currently trying to transition from my own Nest.js backend to Vercel AI, but the streaming is not working as before.

Mar 03 '24 19:03 christophmeise

Was this happening when streaming RSC? I noticed the same behaviour, here's a video of comparison between RSC and text streaming + client component: https://x.com/valtterikaresto/status/1764412056948576712

Streaming text and rendering client component (right side of the video) seems much smoother for some reason.

Mar 04 '24 08:03 valstu

This was happening on a client component. The component itself is always the same in my tests and it worked fine with OpenAI / Nest.js + Langchain + Azure. Only if I use Azure with Vercel AI I have the slow behaviour.

I also tried logging the token stream in the "handleLLMNewToken" callback handler. Even there the tokens come slowly in big chunks. So the problem is not in rendering, it is in the streaming implementation.

Mar 04 '24 08:03 christophmeise

This streaming behavior is due to Azure content filtering. [https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython](Azure Content filter) This can cause delay and is also a contributor to why the tokens are received in much larger chunks. We have reached out to Microsoft on the issue but to no avail.

We came up with a temporary "fix" One way was to read the received tokens into an array on the client and then pushing them out little by little to imitate a more smooth streaming effect. However this caused some weird UI bugs.

So now we instead read the received tokens into a Uint8Array[] on the server and using the Transform and TransformCallback from node:stream and then we calculate how fast the tokens should be emitted to the client since the tokens form azure OpenAI can very in speed. It is quite hard to implement but it is possible. If you need help with it, feel free to reach out to me on linkin or here :)

Mar 04 '24 09:03 ElectricCodeGuy

Sorry, but this seems wrong.

Again: Azure streams correctly when I stream via Langchain on my node backend - tokens come bit by bit, not chunky at all. I use the same content filter and deployment.

How can this be an Azure problem when it works for me on node and not with this sdk??

Mar 04 '24 12:03 christophmeise

I can confirm this even happens on the Azure OpenAI Playground, so it is indeed an Azure issue. I have access to customizing the content filters (had better luck than @ElectricCodeGuy) but no matter what their configuration is, the streaming is still slow and chunky. See video below.

Has anyone implemented a solution other than creating another stream that splits the chunks into letters or few characters to emulate a smoother streaming (at the expense of higher initial latency and more code).

https://github.com/vercel/ai/assets/18277423/038bbdb6-008b-4810-9729-703fc9cc2517

Mar 05 '24 13:03 afbarbaro

Yes, I see that it also happens in the Azure playground - which indicates but does not proof that it is an Azure problem.

Has anyone tried to use @langchain/openai and just call Azure directly? Here is a video from a part of our app that uses Langchain + Azure without any chunk splitting and with zero additional latency.

https://github.com/vercel/ai/assets/12641968/01f72745-5056-4754-b892-ad465ea097ed

Here is a screenshot from the handleLLMNewToken console logs Screenshot 2024-03-07 153249

Here is the code snipped how I stream with no problems from Azure like you see in the video & logs:

const model = new ChatOpenAI({
      temperature: 0.6,
      topP: 0.96,
      maxTokens: -1,
      modelName: "gpt-4",
      azureOpenAIApiDeploymentName: "xxx", // using the same azure model like with vercel sdk
    }).bind(modelParams);

 const chain = prompt
      .pipe(model as any)
      .pipe(new JsonOutputFunctionsParser()); // using json because I have an array but works with all parsers

const stream = await chain.stream({
 some_context: xxx
});

return new Observable((subscriber) => {
      (async () => {
        let hooks;
        for await (const chunk of stream) {
          console.log(chunk);  // this is from the logs screenshot
          subscriber.next({ data: chunk }); // i just pass it to the endpoint and display in the UI
          hooks = (chunk as any).hooks;
        }
        
        ...
 });

Maybe I am missing something but this is working and a working solution for this exact problem of the thread? If yes -> this is not an Azure problem and the Vercel SDK has an issue with the streaming If no -> I would appreciate an explanation why I don't have this problem with my own endpoint and why it can't also be done in the SDK

Mar 07 '24 14:03 christophmeise

It is an azure problem, it's related to content filtering. It's possible to turn content filtering off, or use an async mode in azure settings. But that is limited access, and only available to managed customers.

But I think that the ai sdk also has a problem, I noticed that the new rsc demo has chunking instead of a smooth flow of tokens?

Do you notice the same issue here: https://sdk.vercel.ai/demo

Mar 08 '24 06:03 kyb3r

This is a problem i've been experiencing for months, i've tried to talk to Microsoft with no luck as well.

It's been quite hard to debug as i'm not sure where that chunk streaming is initiating haha.

@ElectricCodeGuy, @christophmeise I am really curious about both your custom integrations.

We are similarly using langchain/openai eg:

export const streamingModel = new ChatOpenAI({
  // modelName: "gpt-4",
  azureOpenAIApiDeploymentName: "gpt4",
  streaming: true,
  temperature: 0,
  tags: ["GPT-4 Streaming"]
});

but we're calling the model via langchain LLMChain import { LLMChain } from "langchain/chains";

answerWithContextChain.stream({
      chatHistory,
      context: vectorResults,
      question: sanitizedQuestion,
    }, {
      callbacks: [handlers, runCollector]
    });

and then returning to the FE with return new StreamingTextResponse(stream, {}, data);

Mar 13 '24 18:03 JoshFriedmanO3

I have the same problem. Any news?

Mar 13 '24 23:03 JakobStadlhuber

This is a problem i've been experiencing for months, i've tried to talk to Microsoft with no luck as well.

It's been quite hard to debug as i'm not sure where that chunk streaming is initiating haha.

@ElectricCodeGuy, @christophmeise I am really curious about both your custom integrations.

We are similarly using langchain/openai eg:
export const streamingModel = new ChatOpenAI({
  // modelName: "gpt-4",
  azureOpenAIApiDeploymentName: "gpt4",
  streaming: true,
  temperature: 0,
  tags: ["GPT-4 Streaming"]
});
but we're calling the model via langchain LLMChain import { LLMChain } from "langchain/chains";
answerWithContextChain.stream({
      chatHistory,
      context: vectorResults,
      question: sanitizedQuestion,
    }, {
      callbacks: [handlers, runCollector]
    });
and then returning to the FE with return new StreamingTextResponse(stream, {}, data);

As I said in my comment - it works for us when we don't use Vercel and use our own Node endpoint with SSE. We have no workaround or custom buffering for the tokens - it works out of the box.

I am still wondering why everyone is convinced that it is an issue with Azure and not Vercel.

Mar 14 '24 10:03 christophmeise

We do not use Vercel and have the same problem directly with the SDK in Kotlin. Also the same behaviour in the Playground.

Mar 14 '24 10:03 JakobStadlhuber

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API.

Basically its coming through like this:

I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.

Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground.

@christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it.

Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha

For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb

Clearly shows whats happening

Mar 14 '24 15:03 JoshFriedmanO3

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API.

Basically its coming through like this:
I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.
Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground.

@christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it.

Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha

For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb

Clearly shows whats happening

we have the exact same experience.

Mar 15 '24 14:03 JakobStadlhuber

Same here 👍

Mar 26 '24 09:03 emrahtoy

Yeah i've basically narrowed it down (at least in my env) to the point at which the SDK is making requests to the API. The streaming actually works from there to the FE for each token / character, but there is around a ~2 second delay for each chunk that is sent by Azure's API. Basically its coming through like this:
I
I am
I am working
I am working fine

---- 2 second delay

I am working fine. Are
I am working fine. Are you
I am working fine. Are you working
I am working fine. Are you working fine?

---- 2 second delay

I am working fine. Are you working fine? Finished
I am working fine. Are you working fine? Finished streaming.
Regardless of Vercel's SDK, I can recreate the streaming issues in Azure's playground. @christophmeise So thats all i've been able to go off of, the issue may be in Vercel, but I have not been able to identify what would cause it. Clearly the custom streaming integration you wrote is solving this issue, but then why would it apply to the Azure playground. Quite a confusing issue haha For some clarity, heres the video I sent to Microsoft when I was chatting with them: https://www.loom.com/share/31a36da7ded44881a2a720ebdae0346d?sid=957bead8-1831-49fc-892d-9ac1b7d1b1cb Clearly shows whats happening
we have the exact same experience.

I am streaming to a teams bot and I have the exact problem

Mar 28 '24 21:03 red-hunter

Following for updates

Apr 02 '24 21:04 faisal-saddique

From what I have been discussing with folks at Microsoft, their recommendation is to use provisions throughput units; this secures output quota and should fix the chunkiness of the output. I am not fully convinced. Has someone tried this? Here's the documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput Might this be the root cause?

Apr 10 '24 13:04 kemeny

From what I have been discussing with folks at Microsoft, their recommendation is to use provisions throughput units; this secures output quota and should fix the chunkiness of the output. I am not fully convinced. Has someone tried this? Here's the documentation: https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/provisioned-throughput Might this be the root cause?

So, that might not be the solution at all. I did go through another similar issue on the Microsoft forum, where they mentioned that content filtering limits streaming throughput.

In the OAI Azure portal, within Content Filtering (Preview), there's an option to set Streaming mode from Default to Asynchronous Modified Filter. However, this requires approval from Microsoft to activate.

Screenshot 2024-04-10 at 8 22 43 a m

Apr 10 '24 14:04 kemeny

@kemeny This is further than I was able to get with them. I still do think its related to the content filtering. the stremed token has a ton of json wrapped around it for each token, which makes me think the pre-processing is causing that chunkiness. I was also curious about the PTU's, but don't have the ability to get em. Same for the content filtering, was not allowed access.

Really just an overall frustrating experience. Legit no other cloud platform has streaming issues like Azure.

Apr 10 '24 14:04 JoshFriedmanO3

What about https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cpython#asynchronous-modified-filter? Screenshot 2024-04-11 at 09 17 09

EDIT: There are some drawbacks to be carefully reviewed: "Customers must be aware that while the feature improves latency, it's a trade-off against the safety and real-time vetting of smaller sections of model output."

Apr 11 '24 07:04 vhiairrassary

I heard that buying PTU does actually improve the performance. Although on the same sentence I also heard that GPT-4 will require a lot of PTU and pricing is suddenly climbing to five figures per month.

Also it seems like it is quite difficult to get access to modified content filter -program. So hands are pretty much tied at this point.

Apr 11 '24 07:04 valstu

I have done a small workaround for this behavior.

Basically this function

export function logStream(originalStream: ReadableStream) {
  const [loggedStream, loggingStream] = originalStream.tee();
  return new ReadableStream({
    async start(controller) {
      const reader = loggingStream.getReader();
      async function read() {
        const { done, value } = await reader.read();
        if (done) {
          controller.close();
          return;
        }
        controller.enqueue(value);
        await new Promise(resolve => setTimeout(resolve, 80));
        read();
      }
      read();
    }
  });
}

after that change it

return new StreamingTextResponse(stream);

to

return new StreamingTextResponse(logStream(stream));

Apr 16 '24 16:04 allancarvalho

@allancarvalho Yeah this is a great solution considering we cant get the performance we want unless we turn off the content filtering and have the async streaming.

Thank you for this!

Apr 16 '24 16:04 JoshFriedmanO3

I am having the same problem with Python SDK. If we managed to disable content filter, would it be more unsafe than using openai directly instead of azure or the same?

Apr 24 '24 18:04 vashat

I'm closing this issue as it seems pretty obvious this isn't an AI SDK issue. Thank you all for the feedback and the investigation!

May 11 '24 21:05 MaxLeiter

ai ai copied to clipboard

Azure OpenAI streaming is slow and chunky

Description

Code example

Additional context

ai
ai copied to clipboard