[BUG] UTF-8 Corruption Bug
Environment
- Platform (select one):
- [ ] Anthropic API
- [ ] AWS Bedrock
- [ ] Google Vertex AI
- [X] Other: Anthropic Max subscription
- Claude CLI version: 1.0.16 (Claude Code)
- Operating System: Archlinux Linux 6.14.7-arch2-1 x86_64
- Terminal: Konsole
- Python: 3.13.3
- Locale: en_US.UTF-8 (properly configured)
- Tool: Claude Code Write tool
Bug Description
The Claude Code Write tool that causes Unicode corruption, resulting in data loss for non-English content. This issue affects file operations and impacts user workflows sometimes.
Problem Summary
The Write tool occasionally corrupts Unicode characters when creating files, specifically:
- Russian Cyrillic text becomes unreadable gibberish (Dialog-First ?>4E>4** - 15+ ?0BB5@=>2 B@8335@>2 4;O 02B><0B8G5A:>3> >1=0@C65=8O 206=>9 8=D>@<0F88)
- Emoji are replaced with control characters
- English text remains unaffected
- Issue occurs "time to time" (intermittent but recurring)
Impact
This bug causes actual data corruption requiring manual file recovery. For users working with non-English content, this represents a significant reliability issue that affects productivity and trust in the tool.
Technical Details
I have prepared a comprehensive bug report with:
- Complete system environment details
- Technical analysis of the corruption pattern
- File examples (corrupted vs corrected)
- Hex dumps and encoding analysis
- Reproduction information
- Root cause hypothesis
The evidence suggests UTF-8 bytes are being incorrectly interpreted as CP1251/Windows-1251 encoding during file write operations, likely due to a race condition or intermittent encoding handling bug.
Steps to Reproduce
- Using Write() utility when writing non-Cyrillic characters (very rare)
Expected Behavior
- 🔍 Dialog-First подход - 15+ паттернов триггеров для автоматического обнаружения важной информации
- 🔄 Smart Update логика - интеллектуальная обработка изменений (ADD, UPDATE, DELETE, MERGE)
- 🗣️ Natural Language навигация - естественное взаимодействие ("покажи вчерашнее обсуждение")
Actual Behavior
- = Dialog-First ?>4E>4 - 15+ ?0BB5@=>2 B@8335@>2 4;O 02B><0B8G5A:>3> >1=0@C65=8O 206=>9 8=D>@<0F88
- = Smart Update ;>38:0 - 8=B5;;5:BC0;L=0O >1@01>B:0 87<5=5=89 (ADD, UPDATE, DELETE, MERGE)
- =ã Natural Language =02830F8O - 5AB5AB25==>5 2708<>459AB285 ("?>:068 2G5@0H=55 >1AC645=85")
Additional Context
Important! Claude Code is able to fix this error on his own if he sees that something is written incorrectly in the file. But once he suspended the work saying that I was violating the rules of using the application, stating as a reason that I was allegedly trying to exploit vulnerabilities (lol)
report_unicode.md - detailed description README.md README_corrupted.md
Yeah. This also happens to me when generating .md files
## "5E=>;>38G5A:89 AB5:
### Core Technologies
- **Blade Templates** - Laravel templating engine 4;O server-side rendering
- **LESS** - CSS ?@5?@>F5AA>@ 4;O AB@C:BC@8@>20==>3> =0?8A0=8O AB8;59
- **Vite** - !>2@5<5==K9 A1>@I8: 0AA5B>2 8 dev-A5@25@
- **PurgeCSS** - #40;5=85 =58A?>;L7C5<KE AB8;59 2 ?@>40:H5=5
- **Inter Font** - A=>2=>9 A8AB5<=K9 H@8DB (Regular, Medium, SemiBold, Bold)
### Asset Pipeline
- **Development**: Live reload G5@57 Vite dev server
- **Production**: 8=8D8:0F8O 8 >?B8<870F8O 0AA5B>2
- **CSS Inlining**: @8B8G5A:85 AB8;8 8=;09=OBAO 2 `<head>` 4;O ?@>872>48B5;L=>AB8
- **Image Optimization**: SVG 8:>=:8 8 >?B8<878@>20==K5 87>1@065=8O
## @E8B5:BC@0 AB8;59 (LESS)
### !B@C:BC@0 D09;>2
ping
🗿
Same problem:
Even more, in the same file might be half OK, half BAD.
Only one solution is to ask CC to fix it. And it sees the wrong symbols. Happens mostly in VSCode extension rather than in cli
I've noticed this in just normal claude.ai web chat with Sonnet 4.5 - It misinterprets an é for é for example.
I've noticed this in just normal
claude.aiweb chat with Sonnet 4.5 - It misinterprets anéforéfor example.
I uploaded a valid UTF-8 file to claude.ai (Sonnet 4.5) and asked it to check grammar. Instead Claude complained about gibberish due to it bungling the encoding. E.g. it reported '"lot’s" - Should be "lot's" (incorrect apostrophe encoding)'.
Claude webapp wrote a bunch of emojis in a file and then can't even read them back in. Is Claude unable to consume Claude's content? Is UTF-8 not the default everywhere?
The Claude assistant integrated into Cursor is corrupting files with ANSI encoding (Windows-1252 or similar) during edits. Even when the original file is in ANSI, Claude forces saving in UTF-8, resulting in:
Replacement of accented characters (e.g., ç, ã, õ, á, é) with mojibake or black boxes (�). Irreversible data loss in legacy files (e.g., old systems, .bat, .vbs, .ini scripts). Inconsistent and non-configurable behavior — it does not respect the original encoding.
Claude Code 2.0.36 It corrupts the data and then look at this as an "appears to violate our Usage Policy". LOL
∴ Thinking…
Отлично! Я вижу проблему. После строки 148 должна быть }, (с запятой), но вместо этого там } (без запятой). Это означает, что между 7-м и 8-м объектом отсутствует
запятая-разделитель.
Строка 148: } - это конец 7-го объекта, но должно быть },
Исправляю файл.
● Update(/home/kladmv/ClaudePrompts/yadro/tatlin-object-docs-training/results/generation/day_3_setup_ops_codex_087.json)
⎿ Error: No changes to make: old_string and new_string are exactly the same.
∴ Thinking…
Ой, я скопировал одинаковый текст. Мне нужно правильно посмотреть на строку 148. Давайте еще раз прочитаем эту область.
● Bash(sed -n '146,151p' results/generation/day_3_setup_ops_codex_087.json | cat -A)
⎿ "domains": ["security", "user-management", "UX"]$
}$
},$
{$
"question": "M-PM-^ZM-PM-0M-PM-:M-PM-8M-PM-5 endpoints M-QM-^AM-PM-2M-QM-^OM-PM-7M-PM-0M-PM-=M-QM-^K M-QM-^A auto-lock
M-PM-:M-PM->M-PM-=M-QM-^DM-PM-8M-PM-3M-QM-^CM-QM-^@M-PM-0M-QM-^FM-PM-8M-PM-5M-PM-9 M-PM-4M-PM-;M-QM-^O users, M-PM-8 M-PM-:M-PM-0M-PM-: M-PM->M-PM-=M-PM-8
M-QM-^@M-PM-0M-PM-7M-PM-;M-PM-8M-QM-^GM-PM-0M-QM-^NM-QM-^BM-QM-^AM-QM-^O M-PM->M-QM-^B manual lock endpoints M-PM-8M-PM-7
M-PM-?M-QM-^@M-PM-5M-PM-4M-QM-^KM-PM-4M-QM-^CM-QM-^IM-PM-8M-QM-^E M-QM-^DM-QM-^@M-PM-0M-PM-3M-PM-<M-PM-5M-PM-=M-QM-^BM-PM->M-PM-2
M-PM-<M-PM-0M-QM-^BM-QM-^@M-PM-8M-QM-^FM-QM-^K?",$
"question_type": "factual",$
⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit
your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model
claude-sonnet-4-20250514 to switch models.
∴ Thinking…
Отлично! Теперь я вижу. На строке 148 есть }, (с запятой!). Но проблема в кодировке - cat
> А что с агентами?
⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit
your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model
claude-sonnet-4-20250514 to switch models.
∴ Thinking…
Пользователь спрашивает о статусе агентов. Давайте посмотрим, что произошло:
1. Я запустил 3 параллельных агента для обработки документов 85
> А?
⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit
your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model
claude-sonnet-4-20250514 to switch models.
∴ Thinking…
Пользователь спрашивает "А?" после того, как я получил ошибку Usage Policy violation. Давайте разберемся что произошло:
1. Я запустил 3 параллельных агента для обработки документов 85-100
2. Все 3 агента успешно завершили работу и вернули отчеты
3. Я нач
Encountered a variant of this bug where the Write tool injected 26 null bytes when writing Unicode arrow characters (→, U+2192) to a new README.md file. The null bytes were scattered throughout the file, not just near the corrupted characters.
Result:
- Arrow chars corrupted to � (U+FFFD)
- File classified as binary data instead of UTF-8 text
- Git treated it as binary: Binary files differ
Verification:
file README.md # data
python3 -c "print(open('README.md','rb').read().count(b'\x00'))" # 26