wiktextract icon indicating copy to clipboard operation
wiktextract copied to clipboard

languages.json does not contain the same data as mw.language.fetchLanguageName?

Open kristian-clausal opened this issue 1 year ago • 7 comments

Looking at these Lua errors I tried to dig down, and it's not user error: though the language code "eml" seems to not be an ISO code(?) it still should function because it can be found inside Module:wikimedia languages's data, where it converts "eml" to "egl".

On the way there, though, it checks that it is a known language by calling mw.language.isKnownLanguageTag(lang_code), which itself calls mw.language.fetchLanguageName() to check if it's in some kind of list of known language codes, but that's where my road ends and I can't figure out where the data comes from.

In any case, wiktextract/wiktextract/data/en/languages.json (which is our language code -> language name(s) data file) doesn't have an entry for "eml", which is causing our version of mw.language.fetchLanguageName to return nil.

This is a todo for myself in the future, but if anyone knows about this...

kristian-clausal avatar Mar 14 '23 13:03 kristian-clausal

As the Wiktionary:Wikimedia language codes page explains, they use some nonexistent or retired language codes. An easy fix would be adding those Wikimedia codes to the languages.json file.

xxyzz avatar Mar 14 '23 14:03 xxyzz

That commit looks good, so I've merged it. I'll leave this issue open, however, in case someone can figure out how to automatically pull the weird Wikimedia codes from somewhere without the need to compile that (admittedly pretty small) list by hand.

kristian-clausal avatar Mar 15 '23 06:03 kristian-clausal

In the resulting JSON data, words in the Serbo-Croatian language are now listed under the bs code (for Bosnian) when they should be listed under sh (or maybe hbs for ISO compatibility). The Serbian (sr), Croatian (hr) and Bosnian languages are also completely missing.

tdulcet avatar Apr 30 '23 12:04 tdulcet

The problem of Serbo-Croatian words being listed under bs is occurring because the patch to include wikimedia codes adds four codes that all map to Serbo-Croatian in LANGUAGES_BY_CODE: bs, hr, sh, sr. When LANGUAGES_BY_NAME is created, the codes are gone through in alphabetical order and the first one is added, so we get Serbo-Croatian $\rightarrow$ bs, then the rest are skipped since Serbo-Croatian is already in the dict. LANGUAGES_BY_NAME is then used when processing page headings to arrive at the code.

I think that instead of adding the wikimedia codes to wiktextract's language data, a different hack to address the original error messages that started this issue might be in order: simply change isKnownLanguageTag in wikitextprocessor to always return true.

Here is my reasoning. The kaikki URL for the original error messages is broken now, but they contain the string "The Wikimedia language code", and a search shows that these errors are coming from Module:interproject, which is calling Module:wikimedia_languages's getByCodeWithFallback which calls getByCode, which starts:

function export.getByCode(code)
	-- Only accept codes the software recognises
	if not mw.language.isKnownLanguageTag(code) then
		return nil
	end

In wikitextprocessor/wiktextract, isKnownLanguageTag gets turned into a python function that checks whether a code is in LANGUAGES_BY_CODE, and since the wikimedia codes were not in this dict at the time of this issue, getByCode returns nil from this if block, in turn causing Module:interproject to throw a lua error. But if this if block were simply not present, then getByCode would have successfully looked up eml in Module:wikimedia_languages/data, and the lua error wouldn't have gotten thrown by Module:interproject. It's only because we don't have access to the real MediaWiki isKnownLanguageTag that the problem arises, but really all the language data that we need in this case is contained in the wiktionary modules, which we do have access to.

In fact, that if block is pretty much extraneous, since getByCode will return nil anyway if a code is not found in either Module:wikimedia_languages or Module:languages. Also, it turns out this is actually the only place that isKnownLanguageTag ever gets called directly in all the modules of en wiktionary, at least according to this search.

jmviz avatar May 01 '23 02:05 jmviz

I also found that real mw.language.isKnownLanguageTag() consults this list of languages, so another approach would be to supply that separately. But I wasn't sure if it was the entirety of the data it uses or not. The relevant logic is here.

jmviz avatar May 01 '23 03:05 jmviz

If we can get away with disabling isKnownLanguageCode, that would be great. So I've gone ahead and done it, and also disabled the adding of Wikimedia codes into LANGUAGES_BY_CODE to prevent it from polluting LANGUAGES_BY_NAME. Hopefully isKnownLanguageCode will fail gracefully, but there's also another way this should be fine; we can kind of assume that this kind of error would already be found on the Wiktionary side of things, unless isKnownLanguageCode is used as a condition (which it shouldn't be, right? most of the time...), so hopefully all issues are fixed there before they become an issue here. hahaha.

kristian-clausal avatar May 02 '23 07:05 kristian-clausal

we can kind of assume that this kind of error would already be found on the Wiktionary side of things, unless isKnownLanguageCode is used as a condition (which it shouldn't be, right? most of the time...), so hopefully all issues are fixed there before they become an issue here. hahaha.

Yes, this is my thinking as well.

jmviz avatar May 02 '23 11:05 jmviz