Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

This PR fixes a bug where French words containing accented characters (é, è, à, ç, etc.) and contractions (n'êtes, l'été) were incorrectly split at each accented character during OCR text recognition word grouping.

🐛 Problem Description

Issue

The BaseRecLabelDecode.get_word_info() method in ppocr/postprocess/rec_postprocess.py only recognized basic ASCII letters (a-z, A-Z) as word characters. Accented characters used in French and other Latin-based languages were incorrectly classified as "splitters", causing words to be broken apart.

Example of the Bug

Before the fix:

Input: "été" (summer)
Output: 3 separate words: ["é", "t", "é"] ❌
Input: "français" (French)
Output: 3 separate words: ["fran", "ç", "ais"] ❌
Input: "n'êtes" (you are)
Output: 3 separate words: ["n", "'", "êtes"] ❌

After the fix:

Input: "été" → Output: 1 word: ["été"] ✅
Input: "français" → Output: 1 word: ["français"] ✅
Input: "n'êtes" → Output: 1 word: ["n'êtes"] ✅

✨ Solution

Changes Made

Added unicodedata import for Unicode character category detection
Implemented is_latin_char() helper function that properly identifies Latin letters with diacritics
Modified get_word_info() method to include accented characters in word grouping logic
Added apostrophe handling for French contractions

Technical Details

The fix uses Python's unicodedata module to check if a character belongs to the Letter category (L*) and has a Latin or French-based Unicode name. This ensures that characters like:

é (LATIN SMALL LETTER E WITH ACUTE)
è (LATIN SMALL LETTER E WITH GRAVE)
à (LATIN SMALL LETTER A WITH GRAVE)
ç (LATIN SMALL LETTER C WITH CEDILLA)

...are correctly recognized as word characters.

📁 Files Modified

Core Changes

ppocr/postprocess/rec_postprocess.py
- Added unicodedata import
- Added is_latin_char() function
- Modified BaseRecLabelDecode.get_word_info() method

Test Files

test_french_accents.py (new)
- Comprehensive test suite for French accented character handling
- Tests various scenarios: simple accents, contractions, mixed text

🧪 Testing

Test Coverage

The included test script validates:

Simple accented words: été, élève
Words with ç: français
Contractions with apostrophes: n'êtes, C'était
Words with à: à demain
Complex sentences with multiple accents

Running Tests

python test_french_accents.py

🔄 Backward Compatibility

✅ Fully backward compatible

This fix:

Only adds new functionality (recognition of accented characters)
Does not change behavior for existing ASCII text
Does not modify the API or function signatures
Uses standard library (unicodedata) - no new dependencies

All existing functionality remains unchanged. Code that worked before will continue to work exactly as before, with the added benefit of proper French (and other Latin-based language) support.

🌍 Impact

Languages Benefited

This fix improves OCR text recognition for all Latin-based languages that use diacritics, including:

French: é, è, ê, à, â, ù, û, ç, ï, etc.
Spanish: á, é, í, ó, ú, ñ, ü
Portuguese: ã, õ, á, é, í, ó, ú, â, ê, ô, ç
German: ä, ö, ü, ß
Italian: à, è, é, ì, ò, ù
And many more...

Use Cases

Document digitization in French-speaking regions
Multilingual OCR applications
Legal and administrative document processing
Educational material processing
International business document handling

📊 Performance Impact

Negligible performance impact:

The is_latin_char() function is only called for non-ASCII characters
Uses efficient unicodedata standard library functions
No additional loops or complex operations
Same time complexity as the original implementation

🔍 Code Quality

✅ Passes all pre-commit hooks:

black (code formatting)
flake8 (linting)
trailing whitespace check
line ending normalization

📝 Related Issues

This fix addresses the issue where French and other Latin-based language texts are incorrectly segmented during OCR post-processing, improving the accuracy and usability of PaddleOCR for international users.

✅ Checklist

[x] Code follows project style guidelines
[x] Self-review completed
[x] Comments added for complex logic
[x] No breaking changes
[x] Test script included
[x] Documentation updated (this PR doc)
[x] All pre-commit hooks pass

🙏 Acknowledgments

This fix was developed and tested on real-world French OCR scenarios, ensuring practical applicability and effectiveness.

Ready for review and merge! 🚀

Nov 06 '25 10:11 Ihebdhouibi

All committers have signed the CLA.

Nov 06 '25 10:11 CLAassistant

Thanks for your contribution!

Nov 06 '25 10:11 paddle-bot[bot]

The test failure appears to be unrelated to this PR. The error is: ModuleNotFoundError: No module named 'langchain.docstore'

This is occurring in PaddleX's retriever module (paddlex/inference/pipelines/components/retriever/base.py:25), which is trying to import:

from langchain.docstore.document import Document

This import path was deprecated in langchain and moved to:

from langchain_core.documents import Document

Refer to API reference : https://reference.langchain.com/python/integrations/langchain_google_community/?h=document#langchain_google_community.DocumentAIWarehouseRetriever

Nov 25 '25 13:11 Ihebdhouibi

PaddleOCR
PaddleOCR copied to clipboard

Fix: Prevent auto-splitting of French accented words in text recognition

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

🐛 Problem Description

Issue

Example of the Bug

✨ Solution

Changes Made

Technical Details

📁 Files Modified

Core Changes

Test Files

🧪 Testing

Test Coverage

Running Tests

🔄 Backward Compatibility

🌍 Impact

Languages Benefited

Use Cases

📊 Performance Impact

🔍 Code Quality

📝 Related Issues

✅ Checklist

🙏 Acknowledgments

PaddleOCR PaddleOCR copied to clipboard

Fix: Prevent auto-splitting of French accented words in text recognition

Pull Request: Fix Auto-Splitting of French Accented Words in Text Recognition

📋 Summary

🐛 Problem Description

Issue

Example of the Bug

✨ Solution

Changes Made

Technical Details

📁 Files Modified

Core Changes

Test Files

🧪 Testing

Test Coverage

Running Tests

🔄 Backward Compatibility

🌍 Impact

Languages Benefited

Use Cases

📊 Performance Impact

🔍 Code Quality

📝 Related Issues

✅ Checklist

🙏 Acknowledgments

PaddleOCR
PaddleOCR copied to clipboard