cohere-typescript UTF-8 streams might break in current stream implementation

I'm using this package in Node.js environment and call cohere.chatStream to generate long Chinese texts. However, The replacement character (�) appears in random places in 'stream-end' event. The following code may convert incomplete UTF-8 chunks, which are yielded in parts by stream, into strings.

https://github.com/cohere-ai/cohere-typescript/blob/40c146c396dd5f4e9c079a9f99f11b3b7c48208e/src/core/streaming-fetcher/Stream.ts#L29-L32

While Latin Basic characters require only 1 byte in UTF-8, other characters, such as CJK characters, need more bytes to encode. This means there's a chance that a character could be split across chunks.

Apr 11 '24 17:04 lazydogP

Hey @lazydogP many thanks for this interesting find! We have repro'd it and will have a fix for you asap.

Apr 11 '24 20:04 billytrend-cohere

@lazydogP we're planning to move to SSE to fix this issue for you! thanks for your patience

Apr 17 '24 20:04 billytrend-cohere

@lazydogP we're planning to move to SSE to fix this issue for you! thanks for your patience

Is this resolved yet?

May 30 '24 15:05 danny-avila

Still happening on the latest version, it seems we can't rely on the stream-end event to avoid the issue:

https://github.com/danny-avila/LibreChat/pull/2922/commits/fe93f3a9688e48536ffc7e319be3b0d9c31243ea

May 30 '24 16:05 danny-avila

Hey all, this issue is now resolved in our v2 chat because we have switched to SSE which makes it easier to parse the streams. You can use it as follows:

    const stream = await cohere.v2.chatStream({
        model: "command-r",
        messages:[{ role: "user", content: "give me lots of emojis" }]
    })

    for await (const chat of stream) {
        if (chat.type === "content-delta") {
            process.stdout.write((chat.delta?.message?.content?.text as any));
        }
    }

Oct 03 '24 11:10 billytrend-cohere