feat: Add smart whitespace stripping for CJK text

Open shukebeta opened this issue 2 months ago • 0 comments

Summary

This PR adds an intelligent whitespace stripping feature specifically designed for CJK (Chinese, Japanese, Korean) text, which often contains unwanted spaces and line breaks when extracted from PDFs and scanned documents.

Problem Statement

When performing OCR on CJK documents (especially PDFs), users often encounter:

Unwanted spaces between CJK characters - OCR engines sometimes insert spaces between characters
Soft line breaks - Documents have forced line breaks due to page width constraints, but these are not semantic breaks
Hard line breaks - Legitimate sentence boundaries that should be preserved
Mixed language text - Documents containing both CJK and Latin text need intelligent handling

Solution

This PR implements a smart whitespace stripping algorithm that:

✅ Removes spaces between CJK characters only (preserves English word spacing) ✅ Removes soft line breaks (after non-punctuation characters)
✅ Preserves hard line breaks (after sentence-ending punctuation: 。！？；：) ✅ Handles paragraph breaks (double newlines → single newline) ✅ Works seamlessly with mixed CJK-Latin text

Implementation Details

User Interface

Added new setting strip-whitespaces to user preferences
Added "Strip whitespaces (Smart)" option to Post-processing menu
Option available to all users (not language-restricted)
Detailed tooltip explains the smart behavior

Technical Changes

normcap/gui/settings.py: Add strip-whitespaces setting
normcap/gui/menu_button.py: Add menu option with descriptive tooltip
normcap/detection/ocr/transformer.py: Implement smart stripping algorithm using regex-based CJK detection
Pipeline integration: Pass setting through tray → detector → recognize → transformer
Unit tests: Comprehensive test coverage for all edge cases

Algorithm

The algorithm uses Unicode ranges to detect CJK characters and applies different rules:

Spaces: Remove only between CJK chars, keep around Latin words
Line breaks: Check previous character for sentence-ending punctuation
Paragraph breaks: Preserve document structure

Test Coverage

Added comprehensive unit tests in test_transformer.py:

Language detection (Chinese, Japanese, Korean, mixed)
Soft/hard line break handling
Space removal between CJK characters
Space preservation around English words
Paragraph break handling
Mixed text scenarios

All tests pass ✅

Example Usage

Before (OCR output with artifacts):

这是 第一 段 文字
继续 在 这里
这是第二段。

这是 English word 混排 的 情况

After (with smart stripping enabled):

这是第一段文字继续在这里这是第二段。
这是 English word 混排的情况

Benefits

🎯 Significantly improves OCR results for CJK documents
🌏 Helps Chinese, Japanese, and Korean users
🔤 Works correctly with mixed-language documents
📄 Especially useful for PDF document OCR
⚡ Zero impact when disabled (default off)

Compatibility

No breaking changes
Feature is opt-in (disabled by default)
Works with existing language detection and OCR pipeline
Compatible with all supported platforms

Related Issues

This addresses common complaints about CJK OCR quality and unwanted whitespace in extracted text.

Testing: Tested on Ubuntu 24.04 with Python 3.10, Tesseract 5.3.4, and various Chinese documents.

Oct 13 '25 07:10 shukebeta