normcap icon indicating copy to clipboard operation
normcap copied to clipboard

feat: Add smart whitespace stripping for CJK text

Open shukebeta opened this issue 2 months ago • 0 comments

Summary

This PR adds an intelligent whitespace stripping feature specifically designed for CJK (Chinese, Japanese, Korean) text, which often contains unwanted spaces and line breaks when extracted from PDFs and scanned documents.

Problem Statement

When performing OCR on CJK documents (especially PDFs), users often encounter:

  1. Unwanted spaces between CJK characters - OCR engines sometimes insert spaces between characters
  2. Soft line breaks - Documents have forced line breaks due to page width constraints, but these are not semantic breaks
  3. Hard line breaks - Legitimate sentence boundaries that should be preserved
  4. Mixed language text - Documents containing both CJK and Latin text need intelligent handling

Solution

This PR implements a smart whitespace stripping algorithm that:

Removes spaces between CJK characters only (preserves English word spacing) ✅ Removes soft line breaks (after non-punctuation characters)
Preserves hard line breaks (after sentence-ending punctuation: 。!?;:) ✅ Handles paragraph breaks (double newlines → single newline) ✅ Works seamlessly with mixed CJK-Latin text

Implementation Details

User Interface

  • Added new setting strip-whitespaces to user preferences
  • Added "Strip whitespaces (Smart)" option to Post-processing menu
  • Option available to all users (not language-restricted)
  • Detailed tooltip explains the smart behavior

Technical Changes

  • normcap/gui/settings.py: Add strip-whitespaces setting
  • normcap/gui/menu_button.py: Add menu option with descriptive tooltip
  • normcap/detection/ocr/transformer.py: Implement smart stripping algorithm using regex-based CJK detection
  • Pipeline integration: Pass setting through tray → detector → recognize → transformer
  • Unit tests: Comprehensive test coverage for all edge cases

Algorithm

The algorithm uses Unicode ranges to detect CJK characters and applies different rules:

  • Spaces: Remove only between CJK chars, keep around Latin words
  • Line breaks: Check previous character for sentence-ending punctuation
  • Paragraph breaks: Preserve document structure

Test Coverage

Added comprehensive unit tests in test_transformer.py:

  • Language detection (Chinese, Japanese, Korean, mixed)
  • Soft/hard line break handling
  • Space removal between CJK characters
  • Space preservation around English words
  • Paragraph break handling
  • Mixed text scenarios

All tests pass ✅

Example Usage

Before (OCR output with artifacts):

这是 第一 段 文字
继续 在 这里
这是第二段。

这是 English word 混排 的 情况

After (with smart stripping enabled):

这是第一段文字继续在这里这是第二段。
这是 English word 混排的情况

Benefits

  • 🎯 Significantly improves OCR results for CJK documents
  • 🌏 Helps Chinese, Japanese, and Korean users
  • 🔤 Works correctly with mixed-language documents
  • 📄 Especially useful for PDF document OCR
  • ⚡ Zero impact when disabled (default off)

Compatibility

  • No breaking changes
  • Feature is opt-in (disabled by default)
  • Works with existing language detection and OCR pipeline
  • Compatible with all supported platforms

Related Issues

This addresses common complaints about CJK OCR quality and unwanted whitespace in extracted text.


Testing: Tested on Ubuntu 24.04 with Python 3.10, Tesseract 5.3.4, and various Chinese documents.

shukebeta avatar Oct 13 '25 07:10 shukebeta