wikipron icon indicating copy to clipboard operation
wikipron copied to clipboard

Use "terminology" ISO-639 codes

Open kylebgorman opened this issue 4 years ago • 9 comments

Currently we are using the ISO-639-2 "bibliographic" codes ("ger" for German). It seems to me that these are not terribly widely used and make compatibility with other multilingual resources poorer than they'd be otherwise; i.e., one has to convert, as I did for #393. I would suggest instead that we switch to the more widely used ISO-639-3 codes, which are "terminological" ("deu" for German). This should be relatively straightforward.

Any thoughts @lfashby @ajmalanoski @agutkin @jacksonllee?

kylebgorman avatar Apr 22 '21 18:04 kylebgorman

The bibliographic codes would be ISO-639-2/B, right?

I think migrating to ISO-639-3 makes sense.

agutkin avatar Apr 22 '21 18:04 agutkin

Yeah @agutkin. For whatever reason that's what our ISO code library spits out. Haven't looked into it further.

kylebgorman avatar Apr 22 '21 18:04 kylebgorman

Ha, interesting:

[…] the German language (Part 1: de) has two codes in Part 2: ger (T code) and deu (B code), whereas there is only one code in Part 2, eng, for the English language.

So I guess the ISO library does the right thing returning the terminologic Part 2 code for German, rather than bibliographic.

agutkin avatar Apr 22 '21 18:04 agutkin

+1 for switching to / using ISO 639-3 codes consistently. I'd imagine German speakers would prefer seeing deu rather than ger, fra for French rather than fre, etc.

jacksonllee avatar Apr 22 '21 18:04 jacksonllee

Makes sense. The reason we are using the ISO639-2/B codes is because a naive research assistant (me) decided we should a year and a half ago. This should be an easy fix, just need to modify our ‘iso’ jsons in scrape/lib to always point to the ISO 639-3 code. I originally built those jsons off of this table.

lfashby avatar Apr 22 '21 18:04 lfashby

+1 for switching to the more widely used format

ajmalanoski avatar Apr 22 '21 19:04 ajmalanoski

For other projects I also need ISO 639 language code look-up in Python, and so I've just released my own package for this purpose and planned on keeping the language codes in there up-to-date moving forward: https://github.com/jacksonllee/iso639

(Other similar Python packages, like the one we've been using in WikiPron, have been unmaintained for 6+ years. There's also pycountry, but it's not as lightweight as we'd want since it has all the other stuff apart from ISO 639 language codes.)

Happy to review a PR that resolves this ticket, however it's gonna be done. It looks like we should be able to simplify things in the repo quite a bit:

jacksonllee avatar May 16 '22 14:05 jacksonllee

On Mon, May 16, 2022 at 10:36 AM Jackson L. Lee @.***> wrote:

For other projects I also need ISO 639 language code look-up in Python, and so I've just released my own package for this purpose and planned on keeping the language codes in there up-to-date moving forward: https://github.com/jacksonllee/iso639

That’s great Jackson, we’ll migrate to this. Is it on PyPI yet?

— Reply to this email directly, view it on GitHub https://github.com/CUNY-CL/wikipron/issues/406#issuecomment-1127757285, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABG4OMY6JVASEPCBJLFHGTVKJMOPANCNFSM43NAD2AQ . You are receiving this because you authored the thread.Message ID: @.***>

kylebgorman avatar May 16 '22 16:05 kylebgorman

Is it on PyPI yet?

Yes -- https://pypi.org/project/python-iso639/

jacksonllee avatar May 16 '22 17:05 jacksonllee