Fix i18n fallback mechanism for empty locale files
Fix i18n fallback mechanism for empty locale files
Problem
When running gutenberg2zim --lang ko with an empty ko.json file ({}), the scraper was returning the raw translation key "metadata_defaults.title" instead of falling back to the English translation from en.json.
Investigation
After deep investigation of the python-i18n library source code:
- The library's fallback mechanism is correctly implemented
- Fallback works correctly in isolated tests
- Fallback doesn't work reliably in our scraper's execution context
- The fallback locale was
Noneinstead of"en"when accessed in the scraper
Solution
Implemented a manual fallback mechanism in our t() wrapper function that:
- Defaults fallback locale to
"en"if it'sNone - Explicitly switches to the fallback locale when translation is missing
- Forces a search for the translation in the fallback locale
- Restores the original locale after lookup
Changes
- Modified
scraper/src/gutenberg2zim/i18n.py:- Added manual fallback logic in
t()function - Defaults fallback locale to
"en"ifNone - Explicitly handles locale switching for fallback lookups
- Added thread lock to prevent race conditions when
t()is called from multiple threads (scraper usesmultiprocessing.dummy.Poolfor concurrent processing)
- Added manual fallback logic in
Testing
Tested with:
- Empty
ko.jsonfile ({}) - Missing
ko.jsonfile - Both scenarios correctly fallback to English translations
Result
Empty or missing locale files now correctly fallback to English translations:
INFO:Manual fallback: 'metadata_defaults.title' not found in 'ko', using 'Project Gutenberg Library' from 'en'
INFO: Writing kor ZIM for Project Gutenberg Library
Fixes #360
I'm sorry, but I still don't get how you can say that python-i18n works fine but python-i18n's fallback doesn't work in our context. This seems quite contradictory. To me, there is a bug somewhere, either in our usage or in python-i18n. I'm not in favor at all of merging this complex code which looks way more like a hack than the real solution.
Yea you're right this implementation is indeed a hack, and I wasn't in favor of merging it directly into main either that's why I opened the PR so we could figure this out together.
A more simpler approach I was thinking of was if we just ensure the fallback is always set, basically just check if i18n.get("fallback") is None and set it to "en" in both change_locale() and t(), then let python-i18n's built-in fallback mechanism do its job.
This removes all the complex stuff which I implemented:
- Thread locks
- Manual locale switching
- Complex try/except/finally blocks
-
resource_loaderimport
Let me try this approach once and then push and then maybe you tell me which one looks good to you, is this okay?
I've pushed the simplified version. Could you take a look and see if it looks good enough to merge? It does seem to solve the issue we were having.
Regarding your concern for issue in python-18n itself so when I did a thorough investigation I saw that the python-i18n library might have a design flaw where if the fallback configuration becomes None, it could break the fallback mechanism. The library seems to assume fallback will always be a valid locale string, but it doesn't appear to handle None gracefully. When fallback is None, the library would try to call the translation function with locale=None which could break everything.
What I did is that I created a simple helper function that just checks if fallback is None and sets it back to "en" if needed. I call this check in both change_locale() and t() functions, so we're being defensive about it. This way we ensure the library's assumption is always met, and then we let python-i18n's built-in fallback mechanism do its job naturally.
The fix is pretty minimal, just a few lines, and it should address the root cause rather than working around it. It might also fix the main branch issue where has_strict_translation() was preventing python-i18n's built-in fallback from working.
Let me know what you think. Does this approach look good to you?
Superseeded by https://github.com/openzim/gutenberg/pull/383