utf-8 codec error: surrogates not allowed
Sending certain UTF-8 characters to LLM causes a vague fatal error.
Minimal repro (Note, copy the text exactly, the quotes aren't ascii):
echo “for display” | llm -m flash2
Error: 'utf-8' codec can't encode character '\udc9d' in position 56: surrogates not allowed
Pretty much every single one of my software projects and/or sets of documents fail to work with LLM due to this issue. I use repomix or files-to-prompt, then cat output.txt | llm, then python falls over on what appears to be valid UTF-8. This is difficult to troubleshoot as the position reported seems to be unrelated to character position reported by head or text editors.
I'm on Windows, using nushell or git bash. I can reproduce this when using latest on main, and also with the PR that adds the --file arg to avoid piping.
Does LLM have a very strict character encoding requirement? Are repomix/files-to-prompt expected to sanitise in some way?
I get that you probably can't pipe binary data into LLM, but I thought I bisected down to an emoji character causing the issue yesterday, and LLMs love outputting emoji with some prompts :)
I have the same problem while piping in some code like cat js/scripts.js | llm
Instead, this iconv -f ISO-8859-1 -t UTF-8 js/scripts.js | llm solved my problem
Facing the same thing, also on Windows. Ran the following PowerShell command to strip characters from my input text, but you should confirm this doesn't remove too much:
(Get-Content input.txt -Raw) -replace '[^\u0000-\u007F]', '' | Set-Content output.txt
I also ran into this on Windows, likely due to LLM-inserted emojis. @cotsuka's Powershell command resolved it for me.