cohere-python icon indicating copy to clipboard operation
cohere-python copied to clipboard

Unicode errors (e.g., f��r instead of für) when using "fast" citation quality in RAG flow

Open ulan-yisaev opened this issue 1 year ago • 2 comments

SDK Version (required)
5.9.1

Describe the bug
When using the "fast" option for citation_quality in the RAG flow with the Cohere API, Unicode errors occur, but only for certain characters when they are split across multiple tokens in a streaming response. This causes multi-byte characters like ü, ö, ä to be incorrectly displayed as the Unicode replacement character (), even though the same characters are displayed correctly elsewhere in the response. This issue disappears when citation_quality is set to "accurate".

Examples:

Incorrect encoding in citation_quality: "fast"

  1. Lösungen und beschäftigen:

    • Expected text: "Lösungen und beschäftigen"
    • Received stream:
      {"message": "L\ufffd"}
      {"message": "\ufffdsung"}
      {"message": "en u"}
      {"message": "nd besch\ufffd"}
      {"message": "\ufffdfti"}
      {"message": "gen m"}
      
  2. Körpertemperatur:

    • Expected text: "Körpertemperatur"
    • Received stream:
      {"message": "e K\ufffd"}
      {"message": "\ufffdrpe"}
      {"message": "rtemp"}
      {"message": "erat"}
      {"message": "ur"}
      
  3. Frühsommer:

    • Expected text: "Frühsommer"
    • Received stream:
      {"message": "Fr"}
      {"message": "\ufffd"}
      {"message": "\ufffdh"}
      {"message": "so"}
      {"message": "mme"}
      {"message": "r"}
      

As shown, characters like ö, ü, and ä are split between messages, causing them to be replaced with \ufffd, which represents an invalid character or decoding error.

Expected Behavior
Special characters should be handled and displayed correctly, even when split across tokens in a streaming response, regardless of the citation_quality setting.

Actual Behavior
When citation_quality is set to "fast", characters that are split across tokens (especially multi-byte characters like ü, ö, ä) are incorrectly displayed as the Unicode replacement character ( or \ufffd).

Screenshots
N/A

Workaround
Setting citation_quality to "accurate" resolves the issue, but at the cost of performance.

ulan-yisaev avatar Sep 11 '24 12:09 ulan-yisaev

Hi - I've failed to reproduce this issue - could you share a sample request that consistently reproduces the issue and I can investigate more? Thanks!

daniel-cohere avatar Sep 25 '24 19:09 daniel-cohere

We're also experiencing this issue, but only after using the Cohere model in the Azure marketplace.

Before we were using the model directly from Cohere, and I remember we did have unicode errors as described in the beginning, but the latest version was fine (in Fast mode).

It's also worth noting that the Citation start and end ranges have an offset, once the above bug is encountered.

frankpepermans avatar Sep 28 '24 12:09 frankpepermans

@frankpepermans @ulan-yisaev we are still unable to reproduce the issue. do you have a request we could try to repro with?

mkozakov avatar Nov 22 '24 02:11 mkozakov

Closing as stale

mkozakov avatar May 06 '25 15:05 mkozakov