lollms-webui icon indicating copy to clipboard operation
lollms-webui copied to clipboard

Non ASCII characters are stripped from prompt

Open AlexandreDey opened this issue 1 year ago • 2 comments

Expected Behavior

When prompting with non ASCII character, all characters should be passed to the model.

Current Behavior

Non ASCII character are stripped from the prompt. For languages with accents (ex: French), this encourages the model to imitate text with stripped non-ASCII characters.

Steps to Reproduce

Please provide detailed steps to reproduce the issue.

  1. Provide a prompt or a personality instruction with accent
  2. Look in the logs for "Received message :" printed before generation starts

Possible Solution

unknown. Something happens before calling start_message_generation in api/__init__.py ?

Context

Tried to generate text with instructions in French.

The database contains the accents, the personality loads the accent well, but the content passed to start_message_generation is stripped from accents.

NOTE: installed in Docker

Screenshots

N/A

AlexandreDey avatar Sep 28 '23 08:09 AlexandreDey

The problem is in the regex of the function clean_string. By changing the line to:

pattern = f'[^a-zA-Z0-9\u00C0-\u017F\s{re.escape(punctuation_chars)}]'

Accents are not removed and the model behaves correctly

I can create a pull request soon if you want ?

AlexandreDey avatar Sep 28 '23 10:09 AlexandreDey

Thanks alot. Now it is fixed in V6.7alpha1

ParisNeo avatar Oct 05 '23 18:10 ParisNeo