Future TODO: Replace references from Wiktionary/Wikipedia "language" to "edition" when appropriate
I just came across a post-it note I wrote for myself to remind me that this should be done at some point, but it's small enough not to really matter and big enough to be a pain, so I'm putting it here so that it doesn't get completely forgotten.
Using "language" and "lang" in Wiktextract and Wikitextprocessor is a bit confusing, because it can mean two things: The Wiktextract data's language we are looking at (an entry under ===Franco-Provençal===), or the specific Wiktionary or Wikipedia project that data comes from (en.wiktionary.org, zh.wiktionary.org).
We also have things like (simple.wikipedia.org), which indicates there could also be other prefixes that are less "lang"-ish (not that simple English isn't 'lang'-ish...) and more specialized, or just not have anything specific to do with the language the project is written in.
Tatu said he'd want these references changed to "edition" at some point to minimize confusion, which is a perfect name for this.
I had a question and I hope this is the appropriate place to ask it. What's the difference between the keys "lang" and "lang_code?" Is "lang_code" just encoding a "lang" value, so there is a one-to-one correspondence between the two? Or are they being used for different purposes? Also, maybe related : on the wiktionary main page, regarding the languages at the top. For "English" it shows 8.373 Million+ entries. What "entries" is it referring to specifically? The raw data extract file from this kaikki.org page (20GB file) currently has 10,026,780 json objects/ file lines. And if I get a subset of this file for just lang_code="en" and "mul", that contains 1,447,090 json objects. So what does the "8 million+ Entries" on the main wiktionary page refer to?
Thanks!
"lang" is language name, "lang_code" is usually ISO 639-1.
The 8m number probably is all word pages of all languages in English Wiktionary.