Use "terminology" ISO-639 codes
Currently we are using the ISO-639-2 "bibliographic" codes ("ger" for German). It seems to me that these are not terribly widely used and make compatibility with other multilingual resources poorer than they'd be otherwise; i.e., one has to convert, as I did for #393. I would suggest instead that we switch to the more widely used ISO-639-3 codes, which are "terminological" ("deu" for German). This should be relatively straightforward.
Any thoughts @lfashby @ajmalanoski @agutkin @jacksonllee?
The bibliographic codes would be ISO-639-2/B, right?
I think migrating to ISO-639-3 makes sense.
Yeah @agutkin. For whatever reason that's what our ISO code library spits out. Haven't looked into it further.
Ha, interesting:
[…] the German language (Part 1:
de) has two codes in Part 2:ger(T code) anddeu(B code), whereas there is only one code in Part 2,eng, for the English language.
So I guess the ISO library does the right thing returning the terminologic Part 2 code for German, rather than bibliographic.
+1 for switching to / using ISO 639-3 codes consistently. I'd imagine German speakers would prefer seeing deu rather than ger, fra for French rather than fre, etc.
Makes sense. The reason we are using the ISO639-2/B codes is because a naive research assistant (me) decided we should a year and a half ago. This should be an easy fix, just need to modify our ‘iso’ jsons in scrape/lib to always point to the ISO 639-3 code. I originally built those jsons off of this table.
+1 for switching to the more widely used format
For other projects I also need ISO 639 language code look-up in Python, and so I've just released my own package for this purpose and planned on keeping the language codes in there up-to-date moving forward: https://github.com/jacksonllee/iso639
(Other similar Python packages, like the one we've been using in WikiPron, have been unmaintained for 6+ years. There's also pycountry, but it's not as lightweight as we'd want since it has all the other stuff apart from ISO 639 language codes.)
Happy to review a PR that resolves this ticket, however it's gonna be done. It looks like we should be able to simplify things in the repo quite a bit:
- Remove iso639_1-to-iso639_2.json, iso69_2.json, and unimorph_languages.json.
- Simplify languagecodes.py, e.g., by removing the entries with the comment
ISO 639-3 only.
On Mon, May 16, 2022 at 10:36 AM Jackson L. Lee @.***> wrote:
For other projects I also need ISO 639 language code look-up in Python, and so I've just released my own package for this purpose and planned on keeping the language codes in there up-to-date moving forward: https://github.com/jacksonllee/iso639
That’s great Jackson, we’ll migrate to this. Is it on PyPI yet?
— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/406#issuecomment-1127757285, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OMY6JVASEPCBJLFHGTVKJMOPANCNFSM43NAD2AQ . You are receiving this because you authored the thread.Message ID: @.***>
Is it on PyPI yet?
Yes -- https://pypi.org/project/python-iso639/