Add Mixed Encoding Detection and Conversion Support - Resolves #25

Open Egor-OSSRevival opened this issue 3 months ago • 0 comments

Pull Request Description

Overview

This PR adds mixed encoding support to enca, resolving issue #25 where files with multiple encodings (e.g., GB2312 + UTF-8) could not be processed.

Features

Mixed Encoding Detection (-M / --mixed-encodings)
Detects multiple encodings within one file, reports segments with offsets and lengths.
Configurable Buffer Size (-B / --mixed-buffer-size)
Default 1024 bytes, range 1–1048576. Smaller = finer detection, larger = faster.
Error Handling (-I / --mixed-ignore-errors)
Skips corrupted/unknown segments, falls back to predominant encoding.
Mixed Encoding Conversion (-x with -M)
Converts each segment individually while preserving file integrity.

Usage

# Detect mixed encodings
enca -L pl -M mixed_file.txt

# Convert to UTF-8
enca -L pl -M -x utf8 mixed_file.txt

# Fine-tuned with buffer and error handling
enca -L pl -M -B 256 -I -x utf8 mixed_file.txt

Implementation

Chunk-based analysis with segment merging
Predominant encoding fallback
Integrated with existing conversion system (iconv/recode/internal)
Verbose logging for detailed progress

Documentation

Updated man page and CLI help with examples
Resolves issue #25: mixed encoding files could not be converted

Sep 01 '25 20:09 Egor-OSSRevival