claude-code icon indicating copy to clipboard operation
claude-code copied to clipboard

[BUG] UTF-8 Corruption Bug

Open sumlin opened this issue 6 months ago • 12 comments

Environment

  • Platform (select one):
    • [ ] Anthropic API
    • [ ] AWS Bedrock
    • [ ] Google Vertex AI
    • [X] Other: Anthropic Max subscription
  • Claude CLI version: 1.0.16 (Claude Code)
  • Operating System: Archlinux Linux 6.14.7-arch2-1 x86_64
  • Terminal: Konsole
  • Python: 3.13.3
  • Locale: en_US.UTF-8 (properly configured)
  • Tool: Claude Code Write tool

Bug Description

The Claude Code Write tool that causes Unicode corruption, resulting in data loss for non-English content. This issue affects file operations and impacts user workflows sometimes.

Problem Summary

The Write tool occasionally corrupts Unicode characters when creating files, specifically:

  • Russian Cyrillic text becomes unreadable gibberish (Dialog-First ?>4E>4** - 15+ ?0BB5@=>2 B@8335@>2 4;O 02B><0B8G5A:>3> >1=0@C65=8O 206=>9 8=D>@<0F88)
  • Emoji are replaced with control characters
  • English text remains unaffected
  • Issue occurs "time to time" (intermittent but recurring)

Impact

This bug causes actual data corruption requiring manual file recovery. For users working with non-English content, this represents a significant reliability issue that affects productivity and trust in the tool.

Technical Details

I have prepared a comprehensive bug report with:

  • Complete system environment details
  • Technical analysis of the corruption pattern
  • File examples (corrupted vs corrected)
  • Hex dumps and encoding analysis
  • Reproduction information
  • Root cause hypothesis

The evidence suggests UTF-8 bytes are being incorrectly interpreted as CP1251/Windows-1251 encoding during file write operations, likely due to a race condition or intermittent encoding handling bug.

Steps to Reproduce

  1. Using Write() utility when writing non-Cyrillic characters (very rare)

Expected Behavior

  • 🔍 Dialog-First подход - 15+ паттернов триггеров для автоматического обнаружения важной информации
  • 🔄 Smart Update логика - интеллектуальная обработка изменений (ADD, UPDATE, DELETE, MERGE)
  • 🗣️ Natural Language навигация - естественное взаимодействие ("покажи вчерашнее обсуждение")

Actual Behavior

  • = Dialog-First ?>4E>4 - 15+ ?0BB5@=>2 B@8335@>2 4;O 02B><0B8G5A:>3> >1=0@C65=8O 206=>9 8=D>@<0F88
  • = Smart Update ;>38:0 - 8=B5;;5:BC0;L=0O >1@01>B:0 87<5=5=89 (ADD, UPDATE, DELETE, MERGE)
  • =ã Natural Language =02830F8O - 5AB5AB25==>5 2708<>459AB285 ("?>:068 2G5@0H=55 >1AC645=85")

Additional Context

Important! Claude Code is able to fix this error on his own if he sees that something is written incorrectly in the file. But once he suspended the work saying that I was violating the rules of using the application, stating as a reason that I was allegedly trying to exploit vulnerabilities (lol)

report_unicode.md - detailed description README.md README_corrupted.md

sumlin avatar Jun 06 '25 16:06 sumlin

Yeah. This also happens to me when generating .md files

## "5E=>;>38G5A:89 AB5:

### Core Technologies
- **Blade Templates** - Laravel templating engine 4;O server-side rendering
- **LESS** - CSS ?@5?@>F5AA>@ 4;O AB@C:BC@8@>20==>3> =0?8A0=8O AB8;59
- **Vite** - !>2@5<5==K9 A1>@I8: 0AA5B>2 8 dev-A5@25@
- **PurgeCSS** - #40;5=85 =58A?>;L7C5<KE AB8;59 2 ?@>40:H5=5
- **Inter Font** - A=>2=>9 A8AB5<=K9 H@8DB (Regular, Medium, SemiBold, Bold)

### Asset Pipeline
- **Development**: Live reload G5@57 Vite dev server
- **Production**: 8=8D8:0F8O 8 >?B8<870F8O 0AA5B>2
- **CSS Inlining**: @8B8G5A:85 AB8;8 8=;09=OBAO 2 `<head>` 4;O ?@>872>48B5;L=>AB8
- **Image Optimization**: SVG 8:>=:8 8 >?B8<878@>20==K5 87>1@065=8O

## @E8B5:BC@0 AB8;59 (LESS)

### !B@C:BC@0 D09;>2

volkv avatar Jun 16 '25 04:06 volkv

ping

Image

cq-esports avatar Jul 01 '25 16:07 cq-esports

Image

🗿

volkv avatar Jul 06 '25 08:07 volkv

Image

volkv avatar Jul 15 '25 17:07 volkv

Same problem:

Image

Even more, in the same file might be half OK, half BAD.

Only one solution is to ask CC to fix it. And it sees the wrong symbols. Happens mostly in VSCode extension rather than in cli

IAkumaI avatar Oct 08 '25 09:10 IAkumaI

I've noticed this in just normal claude.ai web chat with Sonnet 4.5 - It misinterprets an é for é for example.

mikehdt avatar Oct 10 '25 11:10 mikehdt

I've noticed this in just normal claude.ai web chat with Sonnet 4.5 - It misinterprets an é for é for example.

I uploaded a valid UTF-8 file to claude.ai (Sonnet 4.5) and asked it to check grammar. Instead Claude complained about gibberish due to it bungling the encoding. E.g. it reported '"lot’s" - Should be "lot's" (incorrect apostrophe encoding)'.

skierpage avatar Oct 25 '25 20:10 skierpage

Claude webapp wrote a bunch of emojis in a file and then can't even read them back in. Is Claude unable to consume Claude's content? Is UTF-8 not the default everywhere?

ckrailo avatar Oct 28 '25 17:10 ckrailo

The Claude assistant integrated into Cursor is corrupting files with ANSI encoding (Windows-1252 or similar) during edits. Even when the original file is in ANSI, Claude forces saving in UTF-8, resulting in:

Replacement of accented characters (e.g., ç, ã, õ, á, é) with mojibake or black boxes (�). Irreversible data loss in legacy files (e.g., old systems, .bat, .vbs, .ini scripts). Inconsistent and non-configurable behavior — it does not respect the original encoding.

leandrodsilvamg-code avatar Oct 30 '25 00:10 leandrodsilvamg-code

Claude Code 2.0.36 It corrupts the data and then look at this as an "appears to violate our Usage Policy". LOL


∴ Thinking…

  Отлично! Я вижу проблему. После строки 148 должна быть }, (с запятой), но вместо этого там } (без запятой). Это означает, что между 7-м и 8-м объектом отсутствует 
  запятая-разделитель.

  Строка 148:   } - это конец 7-го объекта, но должно быть   },

  Исправляю файл.

● Update(/home/kladmv/ClaudePrompts/yadro/tatlin-object-docs-training/results/generation/day_3_setup_ops_codex_087.json)
  ⎿  Error: No changes to make: old_string and new_string are exactly the same.

∴ Thinking…

  Ой, я скопировал одинаковый текст. Мне нужно правильно посмотреть на строку 148. Давайте еще раз прочитаем эту область.

● Bash(sed -n '146,151p' results/generation/day_3_setup_ops_codex_087.json | cat -A)
  ⎿        "domains": ["security", "user-management", "UX"]$                                                                                                    
         }$
       },$
       {$
         "question": "M-PM-^ZM-PM-0M-PM-:M-PM-8M-PM-5 endpoints M-QM-^AM-PM-2M-QM-^OM-PM-7M-PM-0M-PM-=M-QM-^K M-QM-^A auto-lock
     M-PM-:M-PM->M-PM-=M-QM-^DM-PM-8M-PM-3M-QM-^CM-QM-^@M-PM-0M-QM-^FM-PM-8M-PM-5M-PM-9 M-PM-4M-PM-;M-QM-^O users, M-PM-8 M-PM-:M-PM-0M-PM-: M-PM->M-PM-=M-PM-8
     M-QM-^@M-PM-0M-PM-7M-PM-;M-PM-8M-QM-^GM-PM-0M-QM-^NM-QM-^BM-QM-^AM-QM-^O M-PM->M-QM-^B manual lock endpoints M-PM-8M-PM-7
     M-PM-?M-QM-^@M-PM-5M-PM-4M-QM-^KM-PM-4M-QM-^CM-QM-^IM-PM-8M-QM-^E M-QM-^DM-QM-^@M-PM-0M-PM-3M-PM-<M-PM-5M-PM-=M-QM-^BM-PM->M-PM-2
     M-PM-<M-PM-0M-QM-^BM-QM-^@M-PM-8M-QM-^FM-QM-^K?",$
         "question_type": "factual",$
  ⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit 
    your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model 
    claude-sonnet-4-20250514 to switch models.

∴ Thinking…

  Отлично! Теперь я вижу. На строке 148 есть   }, (с запятой!). Но проблема в кодировке - cat

> А что с агентами? 
  ⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit 
    your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model 
    claude-sonnet-4-20250514 to switch models.

∴ Thinking…

  Пользователь спрашивает о статусе агентов. Давайте посмотрим, что произошло:

  1. Я запустил 3 параллельных агента для обработки документов 85

> А? 
  ⎿ API Error: Claude Code is unable to respond to this request, which appears to violate our Usage Policy (https://www.anthropic.com/legal/aup). Please double press esc to edit 
    your last message or start a new session for Claude Code to assist with a different task. If you are seeing this refusal repeatedly, try running /model 
    claude-sonnet-4-20250514 to switch models.

∴ Thinking…

  Пользователь спрашивает "А?" после того, как я получил ошибку Usage Policy violation. Давайте разберемся что произошло:

  1. Я запустил 3 параллельных агента для обработки документов 85-100
  2. Все 3 агента успешно завершили работу и вернули отчеты
  3. Я нач

sumlin avatar Nov 09 '25 15:11 sumlin

Encountered a variant of this bug where the Write tool injected 26 null bytes when writing Unicode arrow characters (→, U+2192) to a new README.md file. The null bytes were scattered throughout the file, not just near the corrupted characters.

Result:

  • Arrow chars corrupted to � (U+FFFD)
  • File classified as binary data instead of UTF-8 text
  • Git treated it as binary: Binary files differ

Verification:

file README.md # data
python3 -c "print(open('README.md','rb').read().count(b'\x00'))"  # 26

reidab avatar Nov 11 '25 19:11 reidab