datasets
datasets copied to clipboard
Language names and language codes: connecting to a big database (rather than slow enrichment of custom list)
The problem: Language diversity is an important dimension of the diversity of datasets. To find one's way around datasets, being able to search by language name and by standardized codes appears crucial.
Currently the list of language codes is here, right? At about 1,500 entries, it is roughly at 1/4th of the world's diversity of extant languages. (Probably less, as the list of 1,418 contains variants that are linguistically very close: 108 varieties of English, for instance.)
Looking forward to ever increasing coverage, how will the list of language names and language codes improve over time? Enrichment of the custom list by HFT contributors (like here) has several issues:
- progress is likely to be slow:
(input required from reviewers, etc.)
- the more contributors, the less consistency can be expected among contributions. No need to elaborate on how much confusion is likely to ensue as datasets accumulate.
- there is no information on which language relates with which: no encoding of the special closeness between the languages of the Northwestern Germanic branch (English+Dutch+German etc.), for instance. Information on phylogenetic closeness can be relevant to run experiments on transfer of technology from one language to its close relatives.
A solution that seems desirable: Connecting to an established database that (i) aims at full coverage of the world's languages and (ii) has information on higher-level groupings, alternative names, etc. It takes a lot of hard work to do such databases. Two important initiatives are Ethnologue (ISO standard) and Glottolog. Both have pros and cons. Glottolog contains references to Ethnologue identifiers, so adopting Glottolog entails getting the advantages of both sets of language codes.
Both seem technically accessible & 'developer-friendly'. Glottolog has a GitHub repo. For Ethnologue, harvesting tools have been devised (see here; I did not try it out).
In case a conversation with linguists seemed in order here, I'd be happy to participate ('pro bono', of course), & to rustle up more colleagues as useful, to help this useful development happen. With appreciation of HFT,
Thanks for opening this discussion, @alexis-michaud.
As the language validation procedure is shared with other Hugging Face projects, I'm tagging them as well.
CC: @huggingface/moon-landing
on the Hub side, there is not fine grained validation we just check that language:
contains an array of lowercase strings between 2 and 3 characters long =)
and for language_bcp47:
we just check it's an array of strings.
The only page where we have a hardcoded list of languages is https://huggingface.co/languages and I've been thinking of hooking that page on an external database of languages (so any suggestion is super interesting), but it's not used for validation.
That being said, in datasets
this file https://github.com/huggingface/datasets/blob/main/src/datasets/utils/resources/languages.json is not really used no? Or just in the tagging tool? What about just removing it?
also cc'ing @lbourdois who's been active and helpful on those subjects in the past!
PS @alexis-michaud is there a DB of language codes you would recommend? That would contain all ISO 639-1, 639-2 or 639-3 codes
and be kept up to date, and ideally that would be accessible as a Node.js npm package?
cc @albertvillanova too
PS @alexis-michaud is there a DB of language codes you would recommend? That would contain all
ISO 639-1, 639-2 or 639-3 codes
and be kept up to date, and ideally that would be accessible as a Node.js npm package?cc @albertvillanova too
Many thanks for your answer!
The Glottolog database is kept up to date, and has information on the closest ISO code for each Glottocode. So providing a clean table with equivalences sounds (to me) like something perfectly reasonable to expect from their team. To what extent would pyglottolog fit the bill / do the job? (API documentation here) I'm reaching my technical limitations here: I can't assess the distance between what they offer and what the HF team needs. I have opened an Issue in their repo.
Very interested to see where it goes from there.
I just tried pyglottolog to generate a file with all the current IDs (first column).
glottolog languoids
inside the glottolog
repository.
Greetings @alexis-michaud and others, I think perhaps a standards-based approach here would help everyone out both at the technical and social layers of technical innovations.
Let me say a few things:
- there are multiple kinds of assets in AI that should have associated language codes.
- AI Training Data sets
- AI models
- AI outputs These are all distinct components which should be tagged for the language and encoding methods they operate on or enhance. For example, an AI based cross-language tool from French to English (UK variety) still needs to consider if it is operating on oral language speech or written text. This is where IANA language sub-tags come in and are so important. I link to the official source. If one wants to use middleware such as a python package or npm package to manage strings then please make sure those packages are updating codes as they are being revised. I see that @julien-c mentioned BCP-47. BCP-47 is the current standard for language tagging. Following it will make the resources you create more findable and let future users better understand or expect any biases which may have been introduced in the different AI based products.
- BCP-47 is a technical read. However, you will notice that it identifies when to use an ISO 639-1, ISO 639-2, or ISO 639-3. code. This is important for interoperability with many systems. If you are using library systems then you should likely just stick with ISO 639-3 codes.
- If you are going to use Glottolog codes use them after an
-x-
tag in the BCP-47 format to maintain BCP-47 validity. - You should source ISO 639-3 codes directly from the ISO 639-3 registrar as these codes are updated annually, usually in February or March. ISO 639-3 codes have multiple classes:
Active
,Deprecated
, andUnassigned
. This means that string length checking is not a sufficient strategy for validation. - The names of smaller languages often change depending on the language used to describe them. The ISO639-2 documentation has a list of language names for languages with smaller populations for languages in which descriptions about these languages are often written. For example, ISO 639-2's documentation contains the names of languages as they are used in French, German, and English. ISO 639-2 rarely is updated as it is now tied to ISO 639-3's evolution and modern systems should just use ISO 639-3, but these additional names of languages in other languages may not appear in the ISO 369-3 tables.
- Glottolog codes are also updated at least annually. Usually sometime after ISO 639-3 updates.
- Please, if the material is in a written mode, please indicate which script is used unless the IANA field has a
suppress script
value. Please use the script tag that BCP-47 calls for from ISO 15924. This also updates at least annually. - Another great place to look for language names is the Unicode CLDR database for locales. These ought to be congruent with ISO 639-3 but, sometimes CLDR has additional references to languages (such as the french name for a language) which is not contained in ISO 639-2 or ISO 639-3.
- Wikidata for language names is not always a great source of authoritative information. Language names are asymmetrical. Many times they are contrived because there is no actual name for the language in the language referring... e.g. French doesn't have a name for every language in the world, often they say something like: the language of 'x' people. â English does the same. When a language name standard does not have the best name for a language the best way to handle that is to make a change request with the standards registrar. Keeping track of the source list and the version of your source list for your language codes is very important.
- Finally, It would be a great service to technologist, minority language communities, and linguists if for all resources of the three types mentioned in number 1 above you added a record to OLAC. â I can help you with that. OLAC is a search interface for language resources.
Hi everybody!
About the point:
also cc'ing @lbourdois who's been active and helpful on those subjects in the past!
Discussions on the need to improve the Hub's tagging system (applying to both datasets and models) can be found in the following discussion: https://github.com/huggingface/hub-docs/issues/193 Once this system has been redone and satisfies the identified needs, a redesign of the Languages page would also be relevant: https://github.com/huggingface/hub-docs/issues/194. I invite you to read them. But as a quick summary, the exchanges were oriented towards the ISO standard (the first HF system was based on it and it is generally the standard indicated in AI/DL papers) by favouring ISO 639-1 if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't. In addition, it is possible to add BCP-47 tags to consider existing varieties/regionalisms within a language (https://huggingface.co/datasets/AmazonScience/massive/discussions/1). If a language does not belong to either of these two standards, then a request should be made to the HF team to add it manually.
To return to the present discussion, thank you for the various databases and methodologies you mention. It makes a big difference to have linguists in the loop đ.
I have a couple of questions where I think an expert perspective would be appreciated:
-
Do you think it's possible to easily handle tags that have been deprecated potentially for decades? For example (I'm taking the case of Hebrew but this has happened for other languages) I tagged Google models with the "iw" tag because I based it on what the authors gave in their paper see table 6 page 12). It turns out that this ISO tag has in fact been deprecated since 1989 in favour of the "he" tag. It would therefore be necessary to have a verification that transforms the old tags into the most recent ones.
-
When you look up a language on Wikipedia, it usually shows, in addition to the ISO standard, the codes in the Glottolog (which you have already mentioned), ELP and Linguasphere databases. Would you have any opinion about these two other databases?
-
On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists? Based on the first post in this thread that there are about 8000 languages, if one considers that a given language can be pronounced by a speaker of the other 7999, that would theoretically make about 64 million BCP-47 language1-language2 codes existing. And even much more if we consider regionalists with language1_regionalism_x-language2_regionalism_y. I guess there is no such database.
-
Are there any databases that take into account all the existing sign languages in the world? It would be nice to have them included on the Hub.
-
Is there an international classification of languages? A bit like the International Classification of Diseases in medicine, which is established by the WHO and used as a reference throughout the world. The idea would be to have a precise number of languages to which we would then have to assign a unique tag in order to find them later.
-
Finally for the CNRS team, when can we expect to see all the datasets of Pangloss on HF? đ And I don't know if you have a way to help to add also the datasets of CoCoON.
I invite you to read them. But as a quick summary, the exchanges were oriented towards the ISO standard (the first HF system was based on it and it is generally the standard indicated in AI/DL papers) by favouring ISO 639-1 if it exists, and fallback to ISO 639-2 or ISO 639-3 if it doesn't. In addition, it is possible to add BCP-47 tags to consider existing varieties/regionalisms within a language (https://huggingface.co/datasets/AmazonScience/massive/discussions/1). If a language does not belong to either of these two standards, then a request should be made to the HF team to add it manually.
One comment on this fall back system (which generally follows the BCP-47 process). ISO 639-2 has some codes which refer to a language ambiguously. For example, I believe code ara
is used for arabic. In some contexts arabic is considered a single language, however, Egyptian Arabic is quite different from Moroccan Arabic, which are both considered separate languages. These ambiguous codes are valid ISO 639-3 codes but they have a special status. They are called macro codes
. They exist inside the ISO 639-3 standard to provide absolute fallback compatibility between ISO 639-2 and ISO 639-3. However, when considering AI and MT applications with language data, the unforeseen potential applications and the potential for bias using macro codes should be avoided for new applications of language tags to resources. For historical cases where it is not clear what resources were used to create the AI tools or datasets then I understand the use of ambiguous tag uses. So for clarity in language tagging I suggest:
- Strictly following BCP-47
- Whenever possible avoid the use of macro tags in the ISO 639-3 standard. These are BCP-47 valid, but could introduce biases in the application of their use in society. (Generally there are more specific tags available to use in the ISO 639-3 standard.)
- Are there any databases that take into account all the existing sign languages in the world? It would be nice to have them included on the Hub.
Sign Languages present an interesting case. As I understand the situation. The identification of sign languages has been identified as a component of their endangerment. Some sign languages do exist in ISO 639-3. For further discussion on the issue I refer readers to the following publications:
- https://doi.org/10.3390/languages7010049
- https://www.academia.edu/35870983/The_ethics_of_of_language_identification_and_ISO_639
One way to be BCP-47 compliant and identify a sign language which is not identified in any of the BCP-47 referenced standards is to use the ISO 639-3 code for undetermined language und
and then apply a custom suffix indicator (as explained in BCP-47) -x-
and a custom code, such as the ones used in https://doi.org/10.3390/languages7010049
- Is there an international classification of languages? A bit like the International Classification of Diseases in medicine, which is established by the WHO and used as a reference throughout the world. The idea would be to have a precise number of languages to which we would then have to assign a unique tag in order to find them later.
Yes that would be the function of ISO 639-3. It is the reference standard for languages. It includes a code and its name and the status of the code. Many technical metadata standards for file and computer interoperability reference it, many technical library metadata standards reference it. Some linguists use it. Many governments reference it.
Indexing diseases are different from indexing languages in several ways, one way is that diseases are the impact of a pathogen not the pathogen itself. If we take COVID-19 as an example, there are many varieties of the pathogen but broadly speaking there is only one disease â with many symptoms.
- When you look up a language on Wikipedia, it usually shows, in addition to the ISO standard, the codes in the Glottolog (which you have already mentioned), ELP and Linguasphere databases. Would you have any opinion about these two other databases?
While these do appear on wikipedia, I don't know of any information system which uses these codes. I do know that glottolog did import ELP data at one time and its database does contain ELP data I'm not sure if Glottolog regularly ingests new versions of ELP data. I suspect that the use of Linguasphere data may be relevant to users of wikidata as a linked data attribute but I haven't heard of any linked data projects using Linguasphere data for analysis or product development. My impression is that it is fairly unused.
- Do you think it's possible to easily handle tags that have been deprecated potentially for decades? For example (I'm taking the case of Hebrew but this has happened for other languages) I tagged Google models with the "iw" tag because I based it on what the authors gave in their paper see table 6 page 12). It turns out that this ISO tag has in fact been deprecated since 1989 in favour of the "he" tag. It would therefore be necessary to have a verification that transforms the old tags into the most recent ones.
Yes. You can parse the IANA file linked to above (it is regularly updated). All deprecated tags are marked as such in that file. The new prefered tag if there is one, is indicated. ISO 639-3 also indicates a code's status but their list is relevant only codes within their domain (ISO 639-3).
- On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists?
I would interpret en-fr
as english as spoken in France. fr
in this position refers to the geo-political entity not a second language. I see no reason that other linguists should have a different option after having read BCP-47 and understood how it works.
The functional goal here is to tag a language resource as being produced by nonnative speakers, while tagging both languages. There are several problems here. The first is that BCP-47 has no way explicit way to do this. One could use the sub code x-
with a private use code to indicate a second language and infer some meaning as to that language's role. However, there is another problem here which complexifies the situation greatly... how do we know that those english speakers (in France, or from France, or who were native French speakers) were not speaking their third or fourth language rather than their second language. So to conceptualize a sub-tag which indicates the first language of a speech act for speakers in a second (or other) language would need to be carefully crafted. It might then be proposed to the appropriate authorities. For example three sub-tags exist.
There are three registered sub-tags out of a BCP-47 allowed 35. These are x-
, u-
, and t-
. u-
and t-
are defined in RFC6067 and RFC6497 . For more information see the Unicode CLDR documentation where it says:
IETF BCP 47 Tags for Identifying Languages defines the language identifiers (tags) used on the Internet and in many standards. It has an extension mechanism that allows additional information to be included. The Unicode Consortium is the maintainer of the extension âuâ for Locale Extensions, as described in rfc6067, and the extension 't' for Transformed Content, as described in rfc6497.
The subtags available for use in the 'u' extension provide language tag extensions that provide for additional information needed for identifying locales. The 'u' subtags consist of a set of keys and associated values (types). For example, a locale identifier for British English with numeric collation has the following form: en-GB-u-kn-true
The subtags available for use in the 't' extension provide language tag extensions that provide for additional information needed for identifying transformed content, or a request to transform content in a certain way. For example, the language tag "ja-Kana-t-it" can be used as a content tag indicates Japanese Katakana transformed from Italian. It can also be used as a request for a given transformation.
For more details on the valid subtags for these extensions, their syntax, and their meanings, see LDML Section 3.7 Unicode BCP 47 Extension Data.
Hi @lbourdois ! Many thanks for the detailed information.
Discussions on the need to improve the Hub's tagging system (applying to both datasets and models) can be found in the following discussion: huggingface/hub-docs#193 Fascinating topic! To me, the following suggestion has a lot of appeal: "if consider that it was necessary to create an ISO 639-3 because ISO 639-1 was deficient, it would be to do the reverse and thus convert the tags from ISO 639-1 to ISO 639-2 or 3 (https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes or https://iso639-3.sil.org/code_tables/639/data)."
Yes, ISO 639-1 is unsuitable because it has so few codes: less than 200. To address linguistic diversity in 'unrestricted mode', a list of all languages is wanted.
The idea of letting people use their favourite nomenclature and automatically adding the ISO 639-3 three-letter code as a tag is appealing. Thus all the HF datasets would have three-letter language tags (handy for basic search), alongside the authors' preferred tags and language names (including Glottolog tags as well as ISO 639-{1, 2}, and all other options allowed by BCP-47).
Retaining the authors' original tags and language names would be best.
- For language names: some people favour one name over another and it is important to respect their choice. In the case of Yongning Na: alternative names include 'Mosuo', 'Narua', 'Eastern Naxi'... and the names carry implications: people have been reported to come to blows about the use of the term 'Mosuo'.
- For language tags: Glottocodes can be more fine-grained than Ethnologue (ISO 639-3), and some colleagues feel strongly about those.
Thus there would be a BCP-47 tag (sounds like a solid technical choice, though not 'passer-by-friendly': requiring some expertise to interpret) plus an ISO 639-3 tag that could be grabbed easily, and (last but not least) language names spelled out in full. Searches would be easier. No information would be lost.
Are industry practices so conservative that many people are happy with two-letter codes, and consider ISO 639-3 three-letter codes an unnecessary complication? That would be a pity, since there are so many advantages to using longer lists. (Somewhat like the transition to Unicode: sooo much better!) But maybe that conservative attitude is widespread, and it would then need to be taken into account. In which case, one could consider offering two-letter codes as a search option. Internally, the search engine would look up the corresponding 3-letter codes, and produce the search results accordingly.
Now to the other questions:
- Do you think it's possible to easily handle tags that have been deprecated potentially for decades? For example (I'm taking the case of Hebrew but this has happened for other languages) I tagged Google models with the "iw" tag because I based it on what the authors gave in their paper see table 6 page 12). It turns out that this ISO tag has in fact been deprecated since 1989 in favour of the "he" tag. It would therefore be necessary to have a verification that transforms the old tags into the most recent ones. I guess that the above suggestion takes care of this case. The original tag (in this example, "iw") is retained (facilitating cross-reference with the published paper, and respecting the real: the way the dataset was originally tagged). This old tag goes into the
BCP-47
field of the dataset, which can handle quirks & oddities like this one. And a new tag is added in theISO 639-3
field: the 3-letter code "heb".
- When you look up a language on Wikipedia, it usually shows, in addition to the ISO standard, the codes in the Glottolog (which you have already mentioned), ELP and Linguasphere databases. Would you have any opinion about these two other databases?
I'm afraid I never heard about Linguasphere. The online register for Linguasphere (PDF) seems to be from 1999-2000. It seems that the level of interoperability is not very high right now. (By contrast, Glottolog has pyglottolog and in my experience contacts flow well.)
The Endangered Languages Project is something Google started but initially did not 'push' very strongly, it seems. Just airing an opinion on the public Internet, it seems that the project is now solidly rooted at University of HawaiÊ»i at MÄnoa. It seems that they do not generate codes of their own. They refer to ISO 639-3 (Ethnologue) as a code authority when applicable, and otherwise provide comments in so many words, such as that language L currently lacks an Ethnologue code of its own (example here).
- On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists? Based on the first post in this thread that there are about 8000 languages, if one considers that a given language can be pronounced by a speaker of the other 7999, that would theoretically make about 64 million BCP-47 language1-language2 codes existing. And even much more if we consider regionalists with language1_regionalism_x-language2_regionalism_y. I guess there is no such database.
Yes, you noted the difficulty here: that there are so many possible situations. Eventually, each dataset would required descriptors of its own. @BenjaminGalliot points out that, in addition to specifying the speakers' native languages, the degree of language proficiency would also be relevant. How many years did the speakers spend in which area? Talking which languages? In what chronological order? Etc. The complexity defies encoding. The purpose of language codes is to allow for searches that group resources into sets that make sense. Additional information is very important, but would seem to be a matter for 'comments' fields.
- Is there an international classification of languages? A bit like the International Classification of Diseases in medicine, which is established by the WHO and used as a reference throughout the world. The idea would be to have a precise number of languages to which we would then have to assign a unique tag in order to find them later.
As I understand, Ethnologue and Glottolog both try to do that, each in its own way. The simile with diseases seems interesting, to some extent: in both cases it's about human classification of phenomena that have complexity (though some diseases are simpler than others, whereas all languages have much complexity, in different ways).
Three concerns: (i) Technical specifications: we have not yet received feedback on the Japhug and Na datasets in HF. There may be technical considerations that we have not yet thought of and that would need to be taken into account before 'bulk upload'. (ii) Would there be a way to automate the process? The way @BenjaminGalliot did it for Japhug and Na, there was a manual component involved, and doing it by hand for all 200 datasets would not be an ideal workflow, given that the metadata are all clearly arranged. (iii) Some datasets are currently under a 'No derivatives' CreativeCommons license. We could go back to the depositors and argue that the 'No derivatives' mention were best omitted (see here a similar argument about publications): again, we'd want to be sure about the way forward before we set the process into motion.
Our hope would be that some colleagues try out the OutilsPangloss download tool, assemble datasets from Pangloss/Cocoon as they wish, then deposit them to HF.
The idea of letting people use their favourite nomenclature and automatically adding the ISO 639-3 three-letter code as a tag is appealing. Thus all the HF datasets would have three-letter language tags (handy for basic search), alongside the authors' preferred tags and language names (including Glottolog tags as well as ISO 639-{1, 2}, and all other options allowed by BCP-47).
Retaining the authors' original tags and language names would be best.
- For language names: some people favour one name over another and it is important to respect their choice. In the case of Yongning Na: alternative names include 'Mosuo', 'Narua', 'Eastern Naxi'... and the names carry implications: people have been reported to come to blows about the use of the term 'Mosuo'.
- For language tags: Glottocodes can be more fine-grained than Ethnologue (ISO 639-3), and some colleagues feel strongly about those.
Thus there would be a BCP-47 tag (sounds like a solid technical choice, though not 'passer-by-friendly': requiring some expertise to interpret) plus an ISO 639-3 tag that could be grabbed easily, and (last but not least) language names spelled out in full. Searches would be easier. No information would be lost.
@alexis-michaud raises an excellent point. Language Resource users have varying search habits (or approaches). This includes cases where two or more language names refer to a single language. A search utility/interface needs to be flexible and able to present results from various kinds of input in the search process. This could be like how the terms French/Français/Franzosisch (en/fr/de) are names for the same language or it could be a variety of the following: autoglottonyms (how the speakers of the language refer to their language), or exoglottonyms (how others refer to the language). Additionally, in web based searches I have also needed to implement diacritic sensitive and insensitive logic so that users can type with or without diacritics and not have results unnecessarily excluded.
Depending on how detailed of a search problem HF seeks to solve. It may be better to off load complex search to search engines like OLAC which aggregate a lot of language resources. â as I mentioned above I can assist with the informatics on creating an OLAC feed.
Abstracting search logic from actual metadata may prove a useful way to lower the technical debt overhead. Technical tools and library standards use ISO and BCP-47 Standards. So, from a bibliographic metadata perspective this seems to be the way forward with the widest set of use cases.
To get a visual idea of these first exchanges, I coded a Streamlit app that I put online on Spaces: https://huggingface.co/spaces/lbourdois/Language-tags-demo. The code is in Python so I don't know if it can be used by HF who seems to need something in Node.js but it serves as a proof of concept. The advantage is also that you can directly test ideas by enter things in a search bar and see what comes up.
This application is divided into 3 points:
-
The first is to enter a language in natural language to get its code which can then be filled in the YAML file of the README.MD files of the HF datasets or models in order to be referenced and found by everyone. In practice, enter the language (e.g:
English
) you are interested in to get its associated tag (e.g:en
). You can enter several languages by separating them with a comma (e.gFrench,English,German
). You will be given priority to the ISO 639-3 code if it exists otherwise the Glottocode or the BCP47 code (for varieties in particular). If none of these codes are available, it links to a page where the user can contact HF to request to add this tag. If you enter a BCP47 code, it must be entered as follows:Language(Territory)
, for exampleFrench(Canada)
. Attention! If you enter a BCP-47 language, it must be entered first, otherwise the plant code will be displayed. I have to fix this problem but I am moving to a new place, I don't have an internet connection when I want and I prefer to push this first version so that you can already test things now and not have to wait days or weeks. This point is intended to simulate the user's side of the equation, which wonders which tag he should fill in for his language. -
The second is to enter a language code to obtain the name of the language in natural language. In practice, enter the tag (ISO 639-1/2/3, Glottolog or BCP-47) you are interested in (e.g:
fra
) to get its associated language (e.g: French). You can enter several languages by separating them with a comma (e.gfra,eng,deu
). Attention! If you enter a BCP-47 code, it must be entered first, otherwise the plant code will be displayed. Same as the other bug above (it's actually the same one). This point is intended to simulate the side of HF that for a given tag must return the correct language.
To code these two points, I tested two approaches.
- The first one (internal DB in the app) consists in querying a database that HF would have locally at their place. To create this database, I merged the ISO 639 database (https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab) and the Glottolog database (https://glottolog.org/meta/downloads). The result of this merge is visible in the 3rd point of the application qui is an overview of the database.
In the image below, on line 1 of the database, we can see that the Glottocode database gives an ISO 639-3 code (column ISO639P3code) but not the ISO 639 database (column 639-3). Do you have an explanation for this phenomenon?
For BCP 47 codes of the type fr-CA
, I have retrieved the ISO-3166 alpha 1 codes of the territories (https://www.iso.org/iso-3166-country-codes.html).
In practice, what I do is if we enter fr-CA
is that the letters before the -
refer to a language in the Name
column for a 639-1
== fr
(639-3
for fra
or fre
) in the base of my image above. Then I look at the letters after the -
which refers to a territory. It comes out French (Canada)
. I used https://cldr.unicode.org/translation/displaynames/languagelocale-name-patterns for the pattern that came up.
- The second approach (with langcodes lib in the app) consists in using the Python
langcodes
library (https://github.com/rspeer/langcodes) which offers a lot of features in ready-made functions. It manages for example deprecated codes, the validity of an entered code, gives languages from code in the language of your choice (by default in English, but also autoglottonyms), etc. I invite you to read the README of the library. The only negative point is that it hasn't been updated for 10 months so basing your tag system on an external tool that isn't necessarily up to date can cause problems in the long run. But it is certainly an interesting source.
Finally, I have added some information on the number of people speaking/reading the language(s) searched (figures provided by langcodes which are based on those given by ISO). This is not relevant for our topic but it could be figures that could be added as information on the https://huggingface.co/languages page.
What could be done to improve the app if I have time:
- Write the text for the app's homepage to describe what it does. This could serve as a basis for a documentation that I think will be necessary to add somewhere on the HF website to explain how the language tagging system works.
- Deal with the bug mentioned above
- Integrate ISO 3166-1 alpha 2 territories (https://www.iso.org/obp/ui#iso:pub:PUB500001:en)? They offer a finer granularity than ISO 3166-1 alpha 1 which is limited to the country level, but they are very administrative (for French, ISO 3166-1 alpha 2 gives us the "départements" for example).
- Add autoglottonyms? (I only handle English language names for the moment)
- For each language indicate to which family it belongs, in practice this could help to make data augmentation, but especially to classify the languages and find them more easily on the page https://huggingface.co/languages.
Very impressive! Using the prompt 'Japhug' (a language name), the app finds the intended language:
A first question: based on the Glottocode, would it be possible to grab the closest ISO639-3 code? In case there is no match for the exact language variety, one needs to explore the higher-level groupings, level by level. For this language (Japhug), the information provided in the extracted CSV file (glottolog-languoids-v4.6.csv
) is:
sino1245/burm1265/naqi1236/qian1263/rgya1241/core1262/jiar1240
One need not look further than the first higher-level grouping, jiar1240
, to get an ISO639-3 code, namely jya
.
Thus users searching by language names would get ISO639-3 (often less fine-grained than Glottolog) as a bonus. It might be possible to ask the Glottolog team to provide this piece of information as part of an export from their database.
on line 1 of the database, we can see that the Glottocode database gives an ISO 639-3 code (column ISO639P3code) but not the ISO 639 database (column 639-3). Do you have an explanation for this phenomenon?
That is because the language name 'Aewa' is not found in the Ethnologue (ISO 639-3) export that you are using. This export in table form only has one reference name (Ref_Name
). For the language at issue, it is not 'Aewa' but 'Awishira'.
By contrast, the language on line 0 of the database is called 'Abinomn' by both Ethnologue and Glottolog, and accordingly, columns ISO639P3code
and 639-3
both contain the ISO 639-3 code, bsa
.
The full Ethnologue database records alternate names for each language, and I'd bet that 'Aewa' is recorded among alternate names for the 'Ashiwira' language. I can't check because the full Ethnologue database is paywalled.
Glottolog does provide the corresponding ISO 639-3 code for 'Aewa', ash
, which is an exact match (it refers to the same variety as Glottolog abis1238
).
In this specific case, Glottolog provides all the relevant information. I'd say that Glottolog can be trusted for all the codes they provide, including ISO 639-3 codes: they only include them when the match is good.
See previous comment about the cases where there is no exact match between Glottolog and ISO 639-3 (suggested workaround: look at a higher-level grouping to get an ISO 639-3 code).
I will add these two points to my TODO list.
- Since Glottolog can be trust, I will add a condition to the code that if there is no ISO 639-3 code in the "official" database (https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab), look for it in the "ISO639P3code" column of Glottolog.
- For the point of adding the closest ISO 639-3 code for a Glottolog code, what convention should be adopted for the output? Just the ISO 639-3 code, or the ISO 639-3 code - Glottolog code, or the ISO 639-3 code - language name?
To use the example of
Japhug
, should it be justjya
, orjya-japh1234
orjya-Japhug
?
- Integrate ISO 3166-1 alpha 2 territories (https://www.iso.org/obp/ui#iso:pub:PUB500001:en)? They offer a finer granularity than ISO 3166-1 alpha 1 which is limited to the country level, but they are very administrative (for French, ISO 3166-1 alpha 2 gives us the "départements" for example).
I'm concerned with this sort of exploration. Not because I am against innovation. In fact this is an interesting thought exercise. However, to explore this thought further creates cognitive dissidence between BCP-47 authorized codes and other code sets which are not BP-47 compliant. For that reason, I think adding additional codes is a waste of time both for HF devs and for future users who get a confusing idea about language tagging.
Good job for the application!
On the Hub, there is the following dataset where French people speak in English: https://huggingface.co/datasets/Datatang/French_Speaking_English_Speech_Data_by_Mobile_Phone Is there a database to take this case into account? I have not found any code in the Glottolog database. If based on an IETF BCP-47 standard, I would tend to tag the dataset with "en-fr" but would this be something accepted by linguists? Based on the first post in this thread that there are about 8000 languages, if one considers that a given language can be pronounced by a speaker of the other 7999, that would theoretically make about 64 million BCP-47 language1-language2 codes existing. And even much more if we consider regionalists with language1_regionalism_x-language2_regionalism_y. I guess there is no such database.
Yes, you noted the difficulty here: that there are so many possible situations. Eventually, each dataset would required descriptors of its own. @BenjaminGalliot points out that, in addition to specifying the speakers' native languages, the degree of language proficiency would also be relevant. How many years did the speakers spend in which area? Talking which languages? In what chronological order? Etc. The complexity defies encoding. The purpose of language codes is to allow for searches that group resources into sets that make sense. Additional information is very important, but would seem to be a matter for 'comments' fields.
To briefly complete what I said on this subject in a private discussion group, there is a lot of (meta)data associated with each element of a corpus (which language level, according to which criteria, knowing that even among native speakers there are differences, some of which may go beyond what seems obvious to us from a linguistic point of view, such as socio-professional category, life history, environment in the broad sense, etc.), which can be placed in ad-hoc columns, or more freely in a comment/note column. And it is the role of the researcher (in this case a linguist, in all likelihood) to do analyses (statistics...) to determine the relevant data, including criteria that may justify separating different languages (in the broad sense), making separate corpora, etc. Putting this information in the language code is in my opinion doing the job in the opposite and wrong direction, as well as bringing other problems, like where to stop in the list of multidimensional criteria to be integrated, so in my opinion, here, the minimum is the best (the important thing is in my opinion to have well-documented data, globally, by sub-corpus or by line)...
If you are going to use Glottolog codes use them after an -x- tag in the BCP-47 format to maintain BCP-47 validity.
Yes, for the current corpora, I have written:
language:
- jya
- nru
language_bcp47:
- x-japh1234
- x-yong1288
- Add autoglottonyms? (I only handle English language names for the moment)
Autoglossonyms are useful (I use them prior to other glossonyms), but I'm not sure there is an easy way to retrieve them. We can find some of them in the "Alternative Names" panel of Glottolog, but even if we have an API to retrieve them easily, their associated language code will often not be the one we are in (hence the need to do several cycles to find one, which might not be the right one...). Maybe this problem needs more investigation...
For the point of adding the closest ISO 639-3 code for a Glottolog code, what convention should be adopted for the output? Just the ISO 639-3 code, or the ISO 639-3 code - Glottolog code, or the ISO 639-3 code - language name? To use the example of Japhug , should it be just jya, or jya-japh1234 or jya-Japhug?
I strongly insist not to add a language name after the code, it would restart a spiral of problems, notably the choice of the language in question:
- the autoglossonym: in my opinion the best choice, but you have to know itâŠ
- the English name: iniquitous,
- the name in the administratively/politically dominant language of the target language if it is relevant (strictly localized without overlapping, for example): iniquitous and tendentious (and in a way a special case of the previous one)...
- etc.
To get a visual idea of these first exchanges, I coded a Streamlit app that I put online on Spaces: https://huggingface.co/spaces/lbourdois/Language-tags-demo. The code is in Python so I don't know if it can be used by HF who seems to need something in Node.js but it serves as a proof of concept. The advantage is also that you can directly test ideas by enter things in a search bar and see what comes up.
This is really great. You're doing a fantastic job. I love watching the creative process evolve. It is exciting. Let me provide some links to some search interfaces for further inspiration. I always find it helpful to know how others have approached a problem when figuring out my approach. I will link to three examples Glottolog, r12a's language sub-tag chooser, and the FLEx project builder wizard. The first two are online, but the last one is in an application which must be downloaded and works only on windows or linux. I have placed some notes on each of the screenshots.
-
FLEx Language Chooser | application page
In practice, what I do is if we enter
fr-CA
is that the letters before the-
refer to a language in theName
column for a639-1
==fr
(639-3
forfra
orfre
) in the base of my image above. Then I look at the letters after the-
which refers to a territory. It comes outFrench (Canada)
. I used https://cldr.unicode.org/translation/displaynames/languagelocale-name-patterns for the pattern that came up.
What you are doing is looking at the algorithm for Locale generation rather than BCP-47's original documentation. I'm not sure there are difference, there might be. I know that locale IDs generally follow BCP-47 But I think there are some differences such as the use of _
vs. -
.
A first question: based on the Glottocode, would it be possible to grab the closest ISO639-3 code? In case there is no match for the exact language variety, one needs to explore the higher-level groupings, level by level. For this language (Japhug), the information provided in the extracted CSV file (
glottolog-languoids-v4.6.csv
) is:sino1245/burm1265/naqi1236/qian1263/rgya1241/core1262/jiar1240
One need not look further than the first higher-level grouping,jiar1240
, to get an ISO639-3 code, namelyjya
.Thus users searching by language names would get ISO639-3 (often less fine-grained than Glottolog) as a bonus. It might be possible to ask the Glottolog team to provide this piece of information as part of an export from their database.
This is logical, but the fine grained assertions are not the same. That is just because they are in a hierarchical structure today doesn't mean they will be tomorrow. In some cases the glottolog is clearly referring to sub-language variants which will never receive full language status, whereas in other cases glottolog is referencing to unequal entities one or more of which should be a language. Many of the codes in glottolog have no associated documentation indicating what sort of speech variety they are.
@lbourdois
- Since Glottolog can be trust, I will add a condition to the code that if there is no ISO 639-3 code in the "official" database (https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab), look for it in the "ISO639P3code" column of Glottolog.
I'm confused here... if there is no ISO639-3 code in the official database from the registrar, why would you look for an "unofficial" code from someone else? What is the use case here?
For the point of adding the closest ISO 639-3 code for a Glottolog code, what convention should be adopted for the output? Just the ISO 639-3 code, or the ISO 639-3 code - Glottolog code, or the ISO 639-3 code - language name? To use the example of Japhug , should it be just jya, or jya-japh1234 or jya-Japhug?
(answer edited in view of Benjamin Galliot's comment
Easy part of the answer first: jya-Japhug is out, because, as @BenjaminGalliot pointed out above, mixing language names with language codes will make trouble. For Japhug, jya-Japhug
looks rather good: the pair looks nice, the one (jya
) packed together, the other (Japhug
) good and complete while still pretty compact. But think about languages like 'Yongning Na' or 'YucatĂĄn Maya': a code with a space in the middle, like nru-Yongning Na
, is really unsightly and unwieldy, not?
Some principles for language naming in English have been put forward, with some linguistic arguments, but always supposing that such standardization is desirable, actual standardization of language names in English may well never happen.
As for jya-japh1234
: again, at first sight it seems cute, combining two fierce competitors (Ethnologue and Glottolog) into something that gets the best of both worlds.
But @HughP has a point: adding additional codes is a waste of time both for HF devs and for future users who get a confusing idea about language tagging Strong wording, for an important comment: better stick with BCP 47.
So the solution pointed out by Benjamin, from Frances Gillis-Webber and Sabine Tittel, looks attractive: jya-x-japh1234
On the other hand, if the idea for HF Datasets is simply to add the closest ISO 639-3 code for a Glottolog code, maybe it could be provided simply in three letters: providing the 'raw' ISO 639-3 code jya
. Availability of 'straight' ISO 639-3 codes could save trouble for some users, and those who want more detail could look at the rest of the metadata and general information associated with datasets.
The problem seems to have already been raised here: https://drops.dagstuhl.de/opus/volltexte/2019/10368/pdf/OASIcs-LDK-2019-4.pdf
An example can be seen here :
3.1.2 The use of privateuse sub-tag In light of unambiguous language codes being available for the two Khoisan varieties, we propose to combine the ISO 639-3 code for the parent language Nâng, i.e., ânghâ, with the privateuse sub-tag âx-â and the respective Glottocodes stated above. The language tags for N|uu and ââAu can then be defined accordingly: N|uu: ngh-x-nuuu1242 ââAu: ngh-x-auni1243
By the way, while searching for this, I came across this application: https://huggingface.co/spaces/cdleong/langcode-search
- Since Glottolog can be trust, I will add a condition to the code that if there is no ISO 639-3 code in the "official" database (https://iso639-3.sil.org/sites/iso639-3/files/downloads/iso-639-3.tab), look for it in the "ISO639P3code" column of Glottolog.
I'm confused here... if there is no ISO639-3 code in the official database from the registrar, why would you look for an "unofficial" code from someone else? What is the use case here?
Hi @HughP, I'm happy to clear what confusion may exist here :innocent: Here is the use case. Guillaume Jacques (@rgyalrong) put together a sizeable corpus of the Japhug language. It is up on HF Datasets (here) as well as on Zenodo.
Zenodo is an all-purpose repository without adequate domain-specific metadata ("métadonnées métier"), and the deposits in there are not easy to locate. The Zenodo deposit is intended for a highly specific user case: someone reads about the dataset in a paper, goes to the address on Zenodo and grabs the dataset at one go.
HF Datasets, on the other hand, allows users to look around among corpora. The Japhug corpus needs proper tagging so that HF Datasets users can find out about it.
Japhug has an entry of its own in Glottolog, whereas it lacks an entry of its own in Ethnologue. Hence the practical usefulness of Glottolog. Ethnologue pools together, under the code jya
, three different languages (Japhug, Tshobdun tsho1240
and Zbu zbua1234
).
I hope that this helps.
By the way, while searching for this, I came across this application: https://huggingface.co/spaces/cdleong/langcode-search
Really relevant Space, so tagging its author @cdleong, just in case!
@cdleong A one-stop shop for language codes: terrific! How do you feel about the use of Glottocodes? When searching the language names 'Japhug' and 'Yongning Na' (real examples, related to a HF Datasets deposit & various research projects), the relevant Glottocodes are retrieved, and that is great (and not that easy, notably with the space in the middle of 'Yongning Na'). But this positive result is 'hidden' in the results page. Specifically:
- for Japhug: when searching by language name ('Japhug'), the result in big print is 'Failure', even though there is an available Glottocode (at bottom).
When searching by Glottocode (japh1234), same outcome: 'Result: failure!' (even though this is the right Glottocode When searching by x-japh1234 (Glottocode encapsulated in BCP 47 syntax), one gets the message
''x-japh1234' parses meaningfully as a language tag according to IANA"
but there is paradoxically no link provided to Glottolog: the 'Glottolog' part of the results page is empty
- Yongning Na: the correct code is identified (yong1288) but instead of foregrounding this exact match, the first result that comes up is a completely different language, called 'Yong'.
Trying to formulate a conclusion (admittedly, this note is not based on intensive testing, it is just feedback on initial contact): from a user perspective, it seems that the tool could make more extensive use of Glottolog. langcode-search
does a great job querying Glottolog, why not make more extensive use of that information? (including: to arrive at the nearest ISO 639-3 code)