repomix icon indicating copy to clipboard operation
repomix copied to clipboard

Repomix silently skips files containing undecodable characters (�) or malformed UTF-8 input

Open m0j0mada opened this issue 5 months ago • 3 comments

Description

Repomix skips input files that contain undecodable or malformed characters???such as the Unicode replacement character (???, U+FFFD)???without logging an error or warning. This creates a false impression that all files were successfully processed when in fact some were silently excluded.

# Create a clean UTF-8 file
echo 'test' > file.txt
file -I file.txt
# => text/plain; charset=us-ascii

# Run Repomix (file is processed)
repomix
# => Total Files: 1 files

# Append a problematic character
echo '???' >> file.txt
file -I file.txt
# => text/plain; charset=utf-8

# Run Repomix again (file silently skipped)
repomix
???? Repomix v1.2.1

??? Packing completed successfully!

???? Top 5 Files by Token Count:
??????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

???? Security Check:
??????????????????????????????????????????????????????
??? No suspicious files detected.

???? Pack Summary:
????????????????????????????????????????????????
  Total Files: 0 files
 Total Tokens: 323 tokens
  Total Chars: 1,540 chars
       Output: repomix-output.md
     Security: ??? No suspicious files detected

???? All Done!
Your repository has been successfully packed.

Expected Behavior: Repomix should emit a clear error or warning when a file is skipped due to encoding or character issues, ideally with the filename and line number.

Actual Behavior: The file is excluded without notice, and the file count drops unexpectedly. No error, warning, or debug message is shown.

Usage Context

Repomix CLI

Repomix Version

v1.2.1

Node.js Version

No response

m0j0mada avatar Jul 29 '25 22:07 m0j0mada

Hi, @m0j0mada ! Thank you for reporting this issue!

I suspect there might be a false positive in our binary file detection. Let me investigate this as a potential bug.

I'm also considering whether we should explicitly display files that are excluded due to binary detection. Currently, we perform binary detection in two stages: first by file extension, then by content analysis. If we were to add logging, it would likely be for the content-based detection since that's where the unexpected behavior occurs.

For example, we could show a warning like "File with .txt extension was detected as binary and skipped" to make this behavior more transparent to users.

yamadashy avatar Aug 11 '25 15:08 yamadashy

@m0j0mada

Repomix should emit a clear error or warning when a file is skipped due to encoding or character issues, ideally with the filename and line number.

This feature has been implemented and released in v1.4.0!

https://github.com/yamadashy/repomix/releases/tag/v1.4.0

Example:

📄 Binary Files Detected:
─────────────────────────
3 files detected as binary by content inspection:
1. config/corrupted.txt
2. data/malformed.json  
3. logs/output.log

These files have been excluded from the output.
Please review these files if you expected them to contain text content.

Identifying the exact line number is a bit difficult, so I'd like to address that in a future update.

Please give it a try!

yamadashy avatar Aug 23 '25 07:08 yamadashy

@m0j0mada Did you have time to check the latest release 1.5.0?

reneleonhardt avatar Sep 18 '25 08:09 reneleonhardt