wiktextract
wiktextract copied to clipboard
languages.json does not contain the same data as mw.language.fetchLanguageName?
Looking at these Lua errors I tried to dig down, and it's not user error: though the language code "eml" seems to not be an ISO code(?) it still should function because it can be found inside Module:wikimedia languages's data, where it converts "eml" to "egl".
On the way there, though, it checks that it is a known language by calling mw.language.isKnownLanguageTag(lang_code)
, which itself calls mw.language.fetchLanguageName()
to check if it's in some kind of list of known language codes, but that's where my road ends and I can't figure out where the data comes from.
In any case, wiktextract/wiktextract/data/en/languages.json (which is our language code -> language name(s) data file) doesn't have an entry for "eml", which is causing our version of mw.language.fetchLanguageName to return nil.
This is a todo for myself in the future, but if anyone knows about this...
As the Wiktionary:Wikimedia language codes page explains, they use some nonexistent or retired language codes. An easy fix would be adding those Wikimedia codes to the languages.json
file.
That commit looks good, so I've merged it. I'll leave this issue open, however, in case someone can figure out how to automatically pull the weird Wikimedia codes from somewhere without the need to compile that (admittedly pretty small) list by hand.
In the resulting JSON data, words in the Serbo-Croatian language are now listed under the bs
code (for Bosnian) when they should be listed under sh
(or maybe hbs
for ISO compatibility). The Serbian (sr
), Croatian (hr
) and Bosnian languages are also completely missing.
The problem of Serbo-Croatian
words being listed under bs
is occurring because the patch to include wikimedia codes adds four codes that all map to Serbo-Croatian
in LANGUAGES_BY_CODE
: bs
, hr
, sh
, sr
. When LANGUAGES_BY_NAME
is created, the codes are gone through in alphabetical order and the first one is added, so we get Serbo-Croatian
$\rightarrow$ bs
, then the rest are skipped since Serbo-Croatian
is already in the dict. LANGUAGES_BY_NAME
is then used when processing page headings to arrive at the code.
I think that instead of adding the wikimedia codes to wiktextract's language data, a different hack to address the original error messages that started this issue might be in order: simply change isKnownLanguageTag
in wikitextprocessor to always return true
.
Here is my reasoning. The kaikki URL for the original error messages is broken now, but they contain the string "The Wikimedia language code", and a search shows that these errors are coming from Module:interproject, which is calling Module:wikimedia_languages's getByCodeWithFallback
which calls getByCode
, which starts:
function export.getByCode(code)
-- Only accept codes the software recognises
if not mw.language.isKnownLanguageTag(code) then
return nil
end
In wikitextprocessor/wiktextract, isKnownLanguageTag
gets turned into a python function that checks whether a code is in LANGUAGES_BY_CODE
, and since the wikimedia codes were not in this dict at the time of this issue, getByCode
returns nil
from this if
block, in turn causing Module:interproject
to throw a lua error. But if this if
block were simply not present, then getByCode
would have successfully looked up eml
in Module:wikimedia_languages/data
, and the lua error wouldn't have gotten thrown by Module:interproject
. It's only because we don't have access to the real MediaWiki isKnownLanguageTag
that the problem arises, but really all the language data that we need in this case is contained in the wiktionary modules, which we do have access to.
In fact, that if
block is pretty much extraneous, since getByCode
will return nil
anyway if a code is not found in either Module:wikimedia_languages
or Module:languages
. Also, it turns out this is actually the only place that isKnownLanguageTag
ever gets called directly in all the modules of en wiktionary, at least according to this search.
I also found that real mw.language.isKnownLanguageTag()
consults this list of languages, so another approach would be to supply that separately. But I wasn't sure if it was the entirety of the data it uses or not. The relevant logic is here.
If we can get away with disabling isKnownLanguageCode, that would be great. So I've gone ahead and done it, and also disabled the adding of Wikimedia codes into LANGUAGES_BY_CODE to prevent it from polluting LANGUAGES_BY_NAME. Hopefully isKnownLanguageCode will fail gracefully, but there's also another way this should be fine; we can kind of assume that this kind of error would already be found on the Wiktionary side of things, unless isKnownLanguageCode is used as a condition (which it shouldn't be, right? most of the time...), so hopefully all issues are fixed there before they become an issue here. hahaha.
we can kind of assume that this kind of error would already be found on the Wiktionary side of things, unless isKnownLanguageCode is used as a condition (which it shouldn't be, right? most of the time...), so hopefully all issues are fixed there before they become an issue here. hahaha.
Yes, this is my thinking as well.