enca
enca copied to clipboard
Add Mixed Encoding Detection and Conversion Support - Resolves #25
Pull Request Description
Overview
This PR adds mixed encoding support to enca, resolving issue #25 where files with multiple encodings (e.g., GB2312 + UTF-8) could not be processed.
Features
- Mixed Encoding Detection (-M / --mixed-encodings)
Detects multiple encodings within one file, reports segments with offsets and lengths. - Configurable Buffer Size (-B / --mixed-buffer-size)
Default 1024 bytes, range 1–1048576. Smaller = finer detection, larger = faster. - Error Handling (-I / --mixed-ignore-errors)
Skips corrupted/unknown segments, falls back to predominant encoding. - Mixed Encoding Conversion (-x with -M)
Converts each segment individually while preserving file integrity.
Usage
# Detect mixed encodings
enca -L pl -M mixed_file.txt
# Convert to UTF-8
enca -L pl -M -x utf8 mixed_file.txt
# Fine-tuned with buffer and error handling
enca -L pl -M -B 256 -I -x utf8 mixed_file.txt
Implementation
- Chunk-based analysis with segment merging
- Predominant encoding fallback
- Integrated with existing conversion system (iconv/recode/internal)
- Verbose logging for detailed progress
Documentation
-
Updated man page and CLI help with examples
-
Resolves issue #25: mixed encoding files could not be converted