feat: Add smart whitespace stripping for CJK text
Summary
This PR adds an intelligent whitespace stripping feature specifically designed for CJK (Chinese, Japanese, Korean) text, which often contains unwanted spaces and line breaks when extracted from PDFs and scanned documents.
Problem Statement
When performing OCR on CJK documents (especially PDFs), users often encounter:
- Unwanted spaces between CJK characters - OCR engines sometimes insert spaces between characters
- Soft line breaks - Documents have forced line breaks due to page width constraints, but these are not semantic breaks
- Hard line breaks - Legitimate sentence boundaries that should be preserved
- Mixed language text - Documents containing both CJK and Latin text need intelligent handling
Solution
This PR implements a smart whitespace stripping algorithm that:
✅ Removes spaces between CJK characters only (preserves English word spacing)
✅ Removes soft line breaks (after non-punctuation characters)
✅ Preserves hard line breaks (after sentence-ending punctuation: 。!?;:)
✅ Handles paragraph breaks (double newlines → single newline)
✅ Works seamlessly with mixed CJK-Latin text
Implementation Details
User Interface
- Added new setting
strip-whitespacesto user preferences - Added "Strip whitespaces (Smart)" option to Post-processing menu
- Option available to all users (not language-restricted)
- Detailed tooltip explains the smart behavior
Technical Changes
- normcap/gui/settings.py: Add strip-whitespaces setting
- normcap/gui/menu_button.py: Add menu option with descriptive tooltip
- normcap/detection/ocr/transformer.py: Implement smart stripping algorithm using regex-based CJK detection
- Pipeline integration: Pass setting through tray → detector → recognize → transformer
- Unit tests: Comprehensive test coverage for all edge cases
Algorithm
The algorithm uses Unicode ranges to detect CJK characters and applies different rules:
- Spaces: Remove only between CJK chars, keep around Latin words
- Line breaks: Check previous character for sentence-ending punctuation
- Paragraph breaks: Preserve document structure
Test Coverage
Added comprehensive unit tests in test_transformer.py:
- Language detection (Chinese, Japanese, Korean, mixed)
- Soft/hard line break handling
- Space removal between CJK characters
- Space preservation around English words
- Paragraph break handling
- Mixed text scenarios
All tests pass ✅
Example Usage
Before (OCR output with artifacts):
这是 第一 段 文字
继续 在 这里
这是第二段。
这是 English word 混排 的 情况
After (with smart stripping enabled):
这是第一段文字继续在这里这是第二段。
这是 English word 混排的情况
Benefits
- 🎯 Significantly improves OCR results for CJK documents
- 🌏 Helps Chinese, Japanese, and Korean users
- 🔤 Works correctly with mixed-language documents
- 📄 Especially useful for PDF document OCR
- ⚡ Zero impact when disabled (default off)
Compatibility
- No breaking changes
- Feature is opt-in (disabled by default)
- Works with existing language detection and OCR pipeline
- Compatible with all supported platforms
Related Issues
This addresses common complaints about CJK OCR quality and unwanted whitespace in extracted text.
Testing: Tested on Ubuntu 24.04 with Python 3.10, Tesseract 5.3.4, and various Chinese documents.