enca icon indicating copy to clipboard operation
enca copied to clipboard

Add Mixed Encoding Detection and Conversion Support - Resolves #25

Open Egor-OSSRevival opened this issue 3 months ago • 0 comments

Pull Request Description

Overview

This PR adds mixed encoding support to enca, resolving issue #25 where files with multiple encodings (e.g., GB2312 + UTF-8) could not be processed.

Features

  • Mixed Encoding Detection (-M / --mixed-encodings)
    Detects multiple encodings within one file, reports segments with offsets and lengths.
  • Configurable Buffer Size (-B / --mixed-buffer-size)
    Default 1024 bytes, range 1–1048576. Smaller = finer detection, larger = faster.
  • Error Handling (-I / --mixed-ignore-errors)
    Skips corrupted/unknown segments, falls back to predominant encoding.
  • Mixed Encoding Conversion (-x with -M)
    Converts each segment individually while preserving file integrity.

Usage

# Detect mixed encodings
enca -L pl -M mixed_file.txt

# Convert to UTF-8
enca -L pl -M -x utf8 mixed_file.txt

# Fine-tuned with buffer and error handling
enca -L pl -M -B 256 -I -x utf8 mixed_file.txt

Implementation

  • Chunk-based analysis with segment merging
  • Predominant encoding fallback
  • Integrated with existing conversion system (iconv/recode/internal)
  • Verbose logging for detailed progress

Documentation

  • Updated man page and CLI help with examples

  • Resolves issue #25: mixed encoding files could not be converted

Egor-OSSRevival avatar Sep 01 '25 20:09 Egor-OSSRevival