pkp-lib Align UI locales with Weblate locales

In the project CRAFT OA, in this issue https://github.com/pkp/pkp-lib/issues/9425, the submission locales will be separated from the UI locales. The decision was made to take Weblate locales (s. https://github.com/WeblateOrg/language-data/blob/main/languages.csv) for submission locales and to also align the UI locales.

CRAFT OA project has identified the following mapping between the current UI locales and Weblate locales: 'be@cyrillic' => 'be', 'bs' => 'bs_Latn', 'fr_FR' => 'fr', 'nb' => 'nb_NO', 'sr@cyrillic' => 'sr_Cyrl', 'sr@latin' => 'sr_Latn', 'uz@cyrillic' => 'uz', 'uz@latin' => 'uz_Latn', 'zh_CN' => 'zh_Hans',

Feb 07 '24 11:02 bozana

@bozana I think the languages.csv might be the same that you can extract from ResourceBundle::getLocales('').

I think it makes sense, I raised this concern when the locales were merged, because we would lose the "country" of the submission (@marcbria).

Feb 07 '24 12:02 jonasraoni

@jonasraoni I do not think it is the same, see this comparison: https://docs.google.com/spreadsheets/d/1EFs2cr7Tw2lwR_JVIQHcnqXg91tVdJna8NMSBdLVW_Q/edit?usp=sharing

For example Belarussian in Weblate is listed with script variants and from ResourceBundle::getLocales('') script variants are missing. On the other hand from the Weblate list we are missing things like es_ES (only es mentioned) but in the ResourceBundle::getLocales('') list it is included.

Feb 09 '24 07:02 ajnyga

The spreadsheet is good to take a decision! :) I personally support using an official variant, even if it's missing one thing or another, as it's more likely to fit external systems.

Feb 09 '24 11:02 jonasraoni

Here is a comparison of ResourceBundle::getLocales('') and Weblate languages.csv. The differences where far bigger that I expected. https://craft-test.online/languageComparison/

Also if this is easier to read https://www.diffchecker.com/0HTZe7UH/

Mar 01 '24 14:03 ajnyga

I think this comparison https://craft-test.online/languageComparison/comparison3.php gives a fairly good idea of the differences between ResourceBundle::getLocales('') and Weblate languages.csv:

The locales they have common (277) are probably the ones that are most used
ResourceBundle::getLocales('') is missing a lot of locales (462) which do not even have a close alternative. This applies especially to three letter locales.
languages.csv are missing even more locales (528) BUT most of these have an alternative.
- The main reason for the missing locales is that languages.csv does not have that many country specific locales so most of the smaller languages only have a 2-letter or a 3-letter code. For example fi exists but fi_FI does not.
- There are only a handful of locales missing totally, like dav, agq, sbp, yav
- Some bigger languages like es and fr lack the country specific locale for es_ES and fr_FR and these are maybe cases where it would have sense to have the ability to specify a country variant.

For me this is a clear indiciation that the Weblate list would work better here although I do understand @jonasraoni comment about using an official variant. The important thing here is that the Weblate list locales are formed according to a standard.

Ideally we could try to include the missing country specific locales OR consider hosting an own languages.csv list.

Mar 02 '24 08:03 ajnyga

First, thank you AJ for taking the time to go through all this and give us easy to digest summaries. Your patience and generosity with your time is commendable.

We've discussed it in various places, but I'll put my position on record in this thread. In short, I am convinced that whatever standard is chosen, we must guarantee three things:

The encoding must allow a specific code for each dialect (existing or potential).
The coding does NOT force us to define hierarchies between languages ("es vs es_MX").
We can modify this list as and when we need to (without relying on third parties).

The reasons? To promote equality between the different languages, to avoid representing them from a colonialist point of view, to encourage and facilitate the task of translators and to have total autonomy to decide, as a project, on a topic as relevant as the localisation of PKP applications.

That said, we could use the weblate list as a starting point and create our own with the changes we consider appropriate?

In this sense, I suggest eliminating any reference to codes without region and (at least in the interface) I would always use the regionalised code (es -> es_ES) instead.

The proposal I am making should be accompanied by developments in line with this:

That this code is set at the time of installation (allowing, for example, the administrator to make an informed decision to choose "Spanish from Spain" if the rest of the Spanish translations are not sufficiently complete), but that, for end users, the dialect is not reported but the language (only "Spanish").
That the default translations plugin is activated by default and allows the user to define the "fallbacks" with which to complete the translations (allowing to define es_ES > es_MX > es_US).

Mar 02 '24 11:03 marcbria

Just to underline:

My comparison is there to answer the questions which list/source we should use for providing the options for the new Submission language/locale selection. Here I think it is important that we allow journals to choose whether they want to just use just a two letter code just for the language like "fr" or if they want to specify a dialect like "fr_CA" for their metadata. In any case most of the places where the metadata ends up in do not support the dialect, but of course might in the future.
What UI languages we provide and how we define them is another question which can of course be discussed here.

Mar 04 '24 07:03 ajnyga

Apologies. I was catching up on this thread (whose title talks about UI) and I forgot to go into the metadata issue.

Although I don't really have a clear opinion on this part. Short answer: In metadata we should allow both?

I reason out loud and if I say something stupid, you let me know.

As most upstreams do not take regionality into account, I suspect that for metadata it is not so important to define it and, if the admin so wishes (I think it is something the Editor should not be able to change), we should allow languages (i.e. "fr" without region code).

But I understand that if some admin considers that it is relevant for the journal to indicate the region, from a perspective respectful of linguistic diversity, the tool should allow the region to be indicated?

In any case, I wouldn't ask about this with every submission and it should be a global parameter, to be defined once during the installation (or to be modified later by the admin... but VERY carefully).

In this sense, the code-lang selector demo you made some months ago (accompanied with a little explanation about the real impact of the decison they are making) sounds like a great solution to me, as far as it let you stop in the detail you require.

Does it make sense to you?

Mar 04 '24 09:03 marcbria

I am now starting to work on this issue. We decided to use Weblate locales, for the submission and metadata locales (s. issue ...) as well as for the rest of the system. As far as I can see we have used sokil library to get the translated locale display names, as well as for conversion between different ISO codes. I think that now we can use the PHP intl functions (e.g. locale_get_display_name) to get the translated locale display names. So no need to use sokil for this any more. However, we will still use sokil library to convert locales into different ISO codes (mostly used in third party services). Tagging here @jonasraoni for his oppinion, because he worked on the current Locale* implementation, and maybe sees/knows what I haven't seen yet :-)

Sep 06 '24 08:09 bozana

@jonasraoni, could you please review the pkp-lib and ojs/omp/ops PRs above? The other PRs are mostly just the renaming of the locale folders. I am not sure about the changes in the ui-library -- I adapted the code, but I am not sure where are those parts of the code used, if some of them are needed at all (maybe I can double check it with Jarda when he is back). Thanks a lot!

EDIT: Now I use the functions form the PKP intl library, for example to get the languange names. Lots of language names there start with the small letter. Maybe we can then see/discuss if we would like to leave it this way, once you have taken a look at the changes/code here...

Oct 29 '24 13:10 bozana

additional commits after the code review:

pkp-lib (4 commits)
ojs (3 commits)
omp (2 commits)
tinymce (1 commit)
citationStyleLanguage (1 commit)
googleScholar (1 commit)
crossref-ojs (1 commit)
crossref-ops (1 commit)
jatsTemplate (1 commit)
oaiJats (1 commit)
plagiarism (1 commit)

pt locale (asked Emma about): ./plugins/generic/defaultTranslation/pt/locale.po (only 2 keys, that exist in pt_PT and pt_BR) ./plugins/themes/immersion/locale/pt/locale.po (pt = pt_PT) - pt_PT contains more keys and is newer. ./plugins/generic/plagiarism/locale/pt/locale.po - the same count of keys, but pt contains some English words that are translated in others ./plugins/generic/customLocale/locale/pt/locale.po - pt contains words with the upper case that is different in others ./plugins/themes/classic/locale/pt/locale.po - copied pt/locale.po to pt_PT/locale.po -- it seems to be more accurate.

Dec 05 '24 13:12 bozana

I don't intend to reopen old debates, but to be clear about some things that we talked about and reading the full thread it's not entirely clear to me:

Is the list of languages directly from weblate (to avoid management burden) or do we keep a replica in our repo (to have autonomy if we want to define our own codes)?
Do we have a universal standard for languages and dialects? For example, when we use FR, PT, ES and EN, is the 2-character code considered to be an alias for a specific dialect (fr_CA, pt_BR, es_ES and en_US) or does it work differently depending on whether it is FR, PT, ES and EN?

I suspect we are taking it from weblate, and we are always using 2-code-langs alias to the most translated dialect, but I like to be sure.

Thank you for all the work.

Dec 05 '24 15:12 marcbria

Thanks @marcbria!

Regarding the first: yes, we are taking it from Weblate. Regarding the second: also here we align with Weblate, but it is also what our translators are using. So we have fr, es and en -- as in Weblate. And our translations are fr in fr_FR, es in es_ES and en in en_US. We have pt_PT and pt_BR. Maybe @asmecher would know better.

Dec 05 '24 18:12 bozana

Hi @jonasraoni, I added a few more changes: I considered your comments, the needed language mapping for different services, and adapted the language distribution properly (now that we assume our locale could be any of the Weblate locales). If you would like to take a look at the most recent commits after your first review, please see the notes in this message, to see what I have changed: https://github.com/pkp/pkp-lib/issues/9707#issuecomment-2520304689. Else, I would ask Alec to once again copy the translations, will then rebase and merge everything... Thanks a lot!

Dec 05 '24 18:12 bozana

@marcbria, we're following BCP47 (and therefore RFC 4647 and RFC 5646), as does Weblate, xml:lang (HTML and JATS), etc.

From https://www.w3.org/International/articles/language-tags/:

The golden rule when creating language tags is to keep the tag as short as possible. Avoid region, script or other subtags except where they add useful distinguishing information. For instance, use ja for Japanese and not ja-JP, unless there is a particular reason that you need to say that this is Japanese as spoken in Japan, rather than elsewhere.

I think the only place where we're breaking this rule is pt vs. pt-PT. We're still using pt-PT because that locale at the time we made the switch to shorter forms was badly incomplete and we were concerned about encouraging its broader adoption as pt before it was ready. In retrospect I think we should have just made it like the others at the same time for consistency's sake. I expect we'll standardize Portuguese into pt (Portugal) and pt_BR (Brazil) sooner or later.

I'm confident that this is the right approach, and not following it will increasingly be swimming against the tide.

Dec 05 '24 21:12 asmecher

Funny that Weblate also has pt, pt_PT and pt_BR...

Dec 06 '24 16:12 bozana

pt_PT is not an invalid locale code, but it is overspecific for our purposes and will not behave well with RFC4647 matching algorithms. A web browser in Angola or Mozambique (where Portuguese is an official language) will fail to match pt-MZ or pt-AO as we have no translations for those, so the matching algorithm will fall back to pt and find Portuguese there.

Because pt_BR is translated and meaningfully distinct, a Brazilian Portuguese speaker will find support for their variant.

(It's not just the language subtags that are treated this way -- all subtags are, including character sets. We wouldn't want to specify Latin as the character set for English, for example.)

Our Weblate install probably has both pt and pt_BR because it prefers pt and guides translators more smoothly to create it, despite the fact that we're still using pt_BR, unfortunately.

Dec 06 '24 17:12 asmecher

So how about we change pt_PT to pt now? -- when we are already doing the changes on locales...

Dec 09 '24 09:12 bozana

Yes please, I think this is a good time to standardize with pt instead of pt_PT!

Dec 09 '24 16:12 asmecher