llm icon indicating copy to clipboard operation
llm copied to clipboard

utf-8 codec error: surrogates not allowed

Open tboby opened this issue 1 year ago • 3 comments

Sending certain UTF-8 characters to LLM causes a vague fatal error.

Minimal repro (Note, copy the text exactly, the quotes aren't ascii):

echo “for display” | llm -m flash2
Error: 'utf-8' codec can't encode character '\udc9d' in position 56: surrogates not allowed

Pretty much every single one of my software projects and/or sets of documents fail to work with LLM due to this issue. I use repomix or files-to-prompt, then cat output.txt | llm, then python falls over on what appears to be valid UTF-8. This is difficult to troubleshoot as the position reported seems to be unrelated to character position reported by head or text editors.

I'm on Windows, using nushell or git bash. I can reproduce this when using latest on main, and also with the PR that adds the --file arg to avoid piping.

Does LLM have a very strict character encoding requirement? Are repomix/files-to-prompt expected to sanitise in some way?

I get that you probably can't pipe binary data into LLM, but I thought I bisected down to an emoji character causing the issue yesterday, and LLMs love outputting emoji with some prompts :)

tboby avatar Feb 23 '25 16:02 tboby

I have the same problem while piping in some code like cat js/scripts.js | llm Instead, this iconv -f ISO-8859-1 -t UTF-8 js/scripts.js | llm solved my problem

laiconsulting avatar Mar 20 '25 12:03 laiconsulting

Facing the same thing, also on Windows. Ran the following PowerShell command to strip characters from my input text, but you should confirm this doesn't remove too much:

(Get-Content input.txt -Raw) -replace '[^\u0000-\u007F]', '' | Set-Content output.txt

cotsuka avatar Mar 20 '25 17:03 cotsuka

I also ran into this on Windows, likely due to LLM-inserted emojis. @cotsuka's Powershell command resolved it for me.

chriscarrollsmith avatar May 30 '25 16:05 chriscarrollsmith