gutenberg Wrong ZIM title

Command used:

gutenberg2zim --lang ko

Nov 28 '25 07:11 benoit74

@benoit74 I think I’ve tracked down the “metadata_defaults.title” issue for the --lang ko build.

The title seems to come from build_zimfile() in scraper/src/gutenberg2zim/zim.py, which does i18n.change_locale(metadata_locales_lang) and then title = title or i18n.t("metadata_defaults.title").

For --lang ko, we resolve ko → kor and then back to ko for the locale, but there doesn’t appear to be a locales/ko.json, so i18n.t might just be returning the key string (metadata_defaults.title) instead of a real title, while the description looks OK because we pass an explicit fallback string.

Maybe we could add a locales/ko.json file with at least a metadata_defaults block (e.g. “Project Gutenberg Library (Korean)” / “All books in Korean from the first producer of free Ebooks”), similar to locales/en.json and locales/mul.json.

As an extra hardening step, we might also add a generic fallback in code like title = title or i18n.t("metadata_defaults.title", "Project Gutenberg Library") so we never surface the raw key even if a locale file is missing, but the main correctness fix would be to provide a proper ko locale entry.

Dec 15 '25 11:12 VikramAditya33

i18n.t is supposed to always fallback to en locale where the key is present:

https://github.com/openzim/gutenberg/blob/809c51467af74e9c482ce5f475298e52115f25e3/scraper/src/gutenberg2zim/i18n.py#L12

There is something missing in your analysis. Is the problem that we do not have any ko.json file at all, while an empty one would work?

Dec 15 '25 14:12 benoit74

Ah yeah, you're right about the fallback I missed that. Without the file, it seems to just return the key string instead of falling back to "en". So creating an empty ko.json should fix it

Dec 16 '25 01:12 VikramAditya33

OK, can you please confirm it does fix it?

If it does, can you please investigate if this is a known issue or expected behavior of i18n library. You might find an issue about it upstream.

If not, we should probably report the bug upstream (because to me it is a bug).

If they said it is not a bug but expected behavior, then we will need to do two things:

create empty files for all supported languages (languages currently present in the Gutenberg dataset, you can check this easily with the CSV the scraper retrieves at startup)
add a failsafe check somewhere that fail the scraper if the lang passed to i18n init does not have its json file

Dec 16 '25 08:12 benoit74

@benoit74 I tested the fix, but unfortunately the empty ko.json file approach isn't working. I created the empty file (tried both {} and {"metadata_defaults": {}}), but when I run the scraper with --lang ko, the log still shows Writing kor ZIM for metadata_defaults.title instead of falling back to "en".

I tried implementing the failsafe check and the script to generate missing locale files, but the fallback mechanism doesn't seem to be triggering even with an empty file present.

Dec 17 '25 04:12 VikramAditya33

P.S It fix works hehe when we add the metadata_defaults structure with actual values to ko.json (copying from en.json). I tested it and the log now shows Writing kor ZIM for Project Gutenberg Library instead of metadata_defaults.title.

Dec 17 '25 04:12 VikramAditya33

P.S For other languages, we'll need to ensure their locale files also have metadata_defaults keys. Most already do, but any missing ones should be updated similarly.

Dec 17 '25 04:12 VikramAditya33

I don't get this. The whole purpose of the fallback language is to not have to populate all keys on all languages. Either there is a bug in the i18n dependency we should report, or there is something we are not doing the right way

Dec 18 '25 08:12 benoit74

Hi @benoit74, after investigating the python-i18n library source code and running extensive tests, I found that while the fallback mechanism works correctly in isolation, it might be failing in our scraper's execution context. I've implemented a manual fallback workaround that successfully resolves the issue.

The problem was that when running gutenberg2zim --lang ko with an empty ko.json file, the scraper was returning the raw translation key "metadata_defaults.title" instead of falling back to the English translation from en.json.

I checked the python-i18n repository and examined the core fallback implementation in i18n/translator.py. The fallback mechanism appears to be correctly implemented, when a key is not found in the current locale and the current locale is not the fallback locale, it recursively calls t() with the fallback locale. I created multiple test scenarios to verify the fallback behavior: basic fallback test with empty ko.json, our actual flow simulation, pre-loading empty file, and using the actual locales directory. All tests confirmed that fallback works correctly in isolation. The library's fallback mechanism seems to be functioning as designed.

However, when testing with the actual scraper execution, the fallback was not working. The logs revealed that the fallback locale was None instead of "en" when accessed in the scraper's execution context, even though we set it to "en" in setup_i18n(). This could be because the fallback configuration wasn't being properly read in the scraper's context, or there might be some state issue with how python-i18n's configuration was being accessed, or the fallback mechanism might not have been triggered due to the None value.

I implemented a manual fallback mechanism in our t() wrapper function that explicitly switches to the fallback locale and looks up the translation directly when the current locale doesn't have it, then restores the original locale.

After implementing the manual fallback, the scraper now works correctly. The logs show successful fallback: 'metadata_defaults.title' not found in 'ko', using 'Project Gutenberg Library' from 'en' and 'metadata_defaults.description' not found in 'ko', using 'All books in English from the first producer of free Ebooks' from 'en', and the final output shows Writing kor ZIM for Project Gutenberg Library. Empty locale files now correctly fallback to English translations.

While python-i18n's fallback mechanism is correctly implemented and works in isolation, there appears to be an issue with how the fallback configuration is accessed or maintained in our scraper's execution context. The manual fallback we implemented doesn't rely on python-i18n's internal fallback mechanism and instead, it explicitly switches to the fallback locale and looks up the translation directly, which ensures reliable behavior regardless of python-i18n's internal state. This might be a robust workaround that ensures translations always fallback correctly, regardless of python-i18n's internal state or configuration issues.

Dec 21 '25 07:12 VikramAditya33

P.S. After code review, I added a thread lock (threading.Lock()) around the locale switching logic in the manual fallback. The scraper uses multiprocessing.dummy.Pool (which is actually a thread pool) for concurrent book processing, and python-i18n uses global state for the locale. Without the lock, when one thread temporarily switches locales for fallback lookup, other threads could see incorrect translations. The lock ensures thread-safe locale switching during fallback lookups.

Dec 21 '25 07:12 VikramAditya33

I was still not convinced, so I had a look at i18nice source code, and explanation looks way simpler.

Have a look at this: https://github.com/solaluset/i18nice/blob/42e7782b4ea922afe27142f56879c80e0824cf94/i18n/config.py#L79-L80

In the scraper, we call i18n.set("locale", "en") and then immediately i18n.set("fallback", "en") ; which in fact sets the "fallback" to None.

Fix seems pretty obvious to me. In change_locale, we should call i18n.set("locale", lang) and then i18n.set("fallback", "en").

I've just pushed a PR.

Dec 22 '25 08:12 benoit74

@benoit74 I deeply apologize for the approach I was using cuz I was looking at the wrong source code the whole time 😭 and couldn’t think of a simpler solution. I’m really sorry about that. Thanks a lot for taking the time to look into it yourself and for pushing the PR. I really appreciate it, and I’ll be more careful next time to double‑check the library source and avoid overcomplicating things.

Dec 22 '25 08:12 VikramAditya33