cohere-typescript
cohere-typescript copied to clipboard
UTF-8 streams might break in current stream implementation
I'm using this package in Node.js environment and call cohere.chatStream
to generate long Chinese texts. However, The replacement character (�) appears in random places in 'stream-end'
event. The following code may convert incomplete UTF-8 chunks, which are yielded in parts by stream
, into strings.
https://github.com/cohere-ai/cohere-typescript/blob/40c146c396dd5f4e9c079a9f99f11b3b7c48208e/src/core/streaming-fetcher/Stream.ts#L29-L32
While Latin Basic characters require only 1 byte in UTF-8, other characters, such as CJK characters, need more bytes to encode. This means there's a chance that a character could be split across chunks.
Hey @lazydogP many thanks for this interesting find! We have repro'd it and will have a fix for you asap.
@lazydogP we're planning to move to SSE to fix this issue for you! thanks for your patience
@lazydogP we're planning to move to SSE to fix this issue for you! thanks for your patience
Is this resolved yet?
Still happening on the latest version, it seems we can't rely on the stream-end
event to avoid the issue:
https://github.com/danny-avila/LibreChat/pull/2922/commits/fe93f3a9688e48536ffc7e319be3b0d9c31243ea
Hey all, this issue is now resolved in our v2 chat because we have switched to SSE which makes it easier to parse the streams. You can use it as follows:
const stream = await cohere.v2.chatStream({
model: "command-r",
messages:[{ role: "user", content: "give me lots of emojis" }]
})
for await (const chat of stream) {
if (chat.type === "content-delta") {
process.stdout.write((chat.delta?.message?.content?.text as any));
}
}