invidious icon indicating copy to clipboard operation
invidious copied to clipboard

[Meta] Localizations / internationalization overhaul

Open SamantazFox opened this issue 4 years ago • 17 comments

As discussed in #1920, there are many issues regarding localizations (l10n, i.e different locales/translations) and internationalization (i18n, i.e making sure content can be translated).

This issue will be used to regroup those many different problems, and keep track of what have been fixed. Feel free to report and discuss l10n/i18n issues here.


Open questions

Which format for key strings (strings identifiers)?

This question comes from PR #1629, where new key strings were added ("year", "hour", "video", ...) but do not correspond to the original string ("This year", "Last hour", ...) like it was the case for all previous strings.

Two options are available:

  • keep the current format (where the key string is the original english US string)
  • switch to short identifiers (like "last_video", "error_invalid_playlist") to avoid confusion and make strings easier to maintain.

Planned work

  • [x] Solve discrepancies between locales, and add tests to ensure it won't happen again (Done in #1920)
  • [ ] Use non-translated locales name instead of ISO codes in settings (See #1916)
  • [ ] Find strings that are supposed to be translated but aren't in the locale files (https://github.com/iv-org/invidious/pull/1920#issuecomment-809055543, #1497)
  • [ ] Move to a proper standard format for translations (See https://github.com/iv-org/invidious/issues/571#issuecomment-497980577, WIP in #2285)
  • [x] Use the locales full name instead of ISO codes (i.e "Arabic" instead of "ar") (See #571) (Done in #2576)
  • [ ] Improve bi-directional text support (Partially done in #2196, requires UI rewrite)
  • [ ] Properly translate dates/numbers (See comments below)

Solved

How to handle singular/plural and other counting forms?

~~The current system is bad, as we don't want a regex in the translation files. We have also to take into account that in some languages, the form for zero, one, two, and more of something can be different.~~

~~The Qt docs have a good article on the subject: https://doc.qt.io/archives/qq/qq19-plurals.html~~

  • We're switching to GNU gettext PO files.

What to do with orphaned translation files?

~~The following locales still exist in the source tree, but are either very outdated or poorly translated. Plus, they're not even loaded by invidious: bn_BD, eu, hu-HU, sk, sr_Cyrl.~~

  • Those have been added back to the project.

SamantazFox avatar Mar 29 '21 22:03 SamantazFox

For that last point, the strings "short" and "long" for the filter UI aren't in the locale files.

syeopite avatar Apr 07 '21 12:04 syeopite

@syeopite thanks :)

SamantazFox avatar Apr 08 '21 00:04 SamantazFox

For anyone that's not in the matrix server, it was discussed awhile back to move everything to GNU Gettext which should allow us to solve all of these issues.

syeopite avatar Jun 17 '21 02:06 syeopite

The way Invidious currently translate dates and numbers is really less than ideal; because of course, languages often have different ways of representing them. For example:

  • Many languages group their numbers by 1,000 instead of 100
  • Symbols used for grouping and decimals differ
  • Counting system are different
  • Datetime patterns are also not the same
  • Etc

I'll have to write another library (similar to twitter-cldr-rb) to take care of this before the gettext migration can continue.

syeopite avatar Aug 01 '21 06:08 syeopite

On an unrelated note, should the name of Invidious be transliterated to other languages?

syeopite avatar Aug 01 '21 06:08 syeopite

I'll have to write another library (similar to twitter-cldr-rb) to take care of this before the gettext migration can continue.

This is, imho, not important. We have no date/currencies to handle, and the few numbers we have are pretty simple numbers (video count in playlist, video shared x minutes/hours/days/... ago, like/view count).

Such a library could be nice in crystal in general, but it is probably overkill for invidious. Imho, we should first focus on important things like the backlog of problems (like URL parameters handling, acessibility, various bugs, the database, the API, general code cleaning) and do the (really) minor i18n/l10n changes later!

On an unrelated note, should the name of Invidious be transliterated to other languages?

This is something that we, as developpers, shouldn't worry about. The translators will transliterate if required. Though, we should provide the translation string(s) for it!

SamantazFox avatar Aug 02 '21 09:08 SamantazFox

We have no date/currencies to handle

We do actually have dates to handle. In fact, it's actually what prompted me to make my comment above:

https://github.com/iv-org/invidious/blob/3de06174bf178f9a8298119eeb506d8376e24980/locales/en-US.json#L343-L370

And also the description dates (though it currently isn't localized in Invidious). description dates

...the few numbers we have are pretty simple numbers (video count in playlist, video shared x minutes/hours/days/... ago, like/view count).

They're not "simple". It differs from languages to languages. Here's a rough list of the issues they have:

  • The expanded view count is something along the lines of 605,754, which doesn't apply to all languages. For instance, in Hindi this would actually be 6,05,754
  • The abbreviated view counts are shortened to a syntax in the form of Number (hundred) + Suffix. This also doesn't apply to all languages. Like in Chinese, this would actually be Number (thousand) + Suffix. Thus 680k in English would be 68万
  • Playlist video counts are often in the thousands. We'd obviously group it like 3,141 in English but in languages such as Polish, it'll be 3141 as grouping begins later on.
  • Etc

All of these are major issues with internationalization that we'll have to resolve.

syeopite avatar Aug 02 '21 16:08 syeopite

We do actually have dates to handle. In fact, it's actually what prompted me to make my comment above:

https://github.com/iv-org/invidious/blob/3de06174bf178f9a8298119eeb506d8376e24980/locales/en-US.json#L343-L370

Those are combined with the 'x' ago string, and afaict, they're properly translated (those are only absolute values, and never exceed 12 (well, except for years that can go up to 13 when posted in 2008).

And also the description dates (though it currently isn't localized in Invidious). description dates

We have this: https://github.com/iv-org/invidious/blob/master/locales/en-US.json#L386

But many translators aren't aware of how this should be "translated".

And, uhhhh, it seems that Crystal doesn't take into account the system's locales for days/months names :c https://github.com/crystal-lang/crystal/blob/af095d72d/src/time/format.cr#L63

That's a bigger issue than what I was expecting/aware of, tbh :/

All of these are major issues with internationalization that we'll have to resolve.

mmh, right.


PS: given that the library you mentionned earlier is coded in Ruby, do you think that it's feasible to port it to crystal? The languages are similar, and it uses lots of yaml, so maybe?

SamantazFox avatar Aug 02 '21 21:08 SamantazFox

Those are combined with the 'x' ago string, and afaict, they're properly translated (those are only absolute values, and never exceed 12 (well, except for years that can go up to 13 when posted in 2008).

I personally don't know any but knowing languages, the form of different words describing time likely changes based on the numbers, or whatever other words show time in that sentence. If a language is like that, we wouldn't be able to handle it easily.

And, uhhhh, it seems that Crystal doesn't take into account the system's locales for days/months names :c

Yep afaik Crystal's stdlib has almost nothing for localization.

As for Time::Format, even if it becomes locale-aware one day, it'll likely be based off of the system locale so we wouldn't be able to use it for Invidious.

PS: given that the library you mentionned earlier is coded in Ruby, do you think that it's feasible to port it to Crystal? The languages are similar, and it uses lots of yaml, so maybe?

That was my original plan but I decided against it since:

  1. That repo is ~200mb (142mb without history) and I'd rather not pull something that massive into Invidious or any other projects that wishes to use a Crystal implementation
    • And that's with an already reduced dataset. The full CLDR data is ~329mb
  2. It depends on reading a lot of YAML files from disk meaning lengthy IO times on startup
  3. Plural forms depends on Ruby's eval function which really isn't ideal, especially for Crystal
  4. Nothing seems to be documented internally and I really don't want to spend the time figuring out how everything works

TLDR: it's better and faster to just re-implement everything rather than porting it.

Point 1 and 2 is likely going to be present on any internationalization lib using CLDR data but they should be reducible via modularization; so I think I'll structure my family of i18n shards like this:

  1. Lens
    • "Main" one
    • Supports reading and using i18n formats
    • Native plural forms for specified format along with a basic CLDR plural form implementation for any format that requires it. (Ex: crystali-i18n)
  2. Lens-Cldr
    • Extension shard to lens with full localization support via CLDR data similar to twitter-cldr-rb
    • Only contain the 20 most spoken languages to reduce size
  3. Optional language pack shards the user can install to extend Lens-Cldr's language support

I'll definitely have to think more about this but either way, I'll have to create something that can utilize CLDR data for internationalization.

Once completed, I think we'd actually be one of the first alternative frontend projects with internationalization that approaches propitiatory software!

syeopite avatar Aug 03 '21 02:08 syeopite

We're switching to GNU gettext PO files.

We should use MO files; parsing and reading binary files should be much faster than parsing text.

syeopite avatar Aug 03 '21 03:08 syeopite

I personally don't know any but knowing languages, the form of different words describing time likely changes based on the numbers, or whatever other words show time in that sentence. If a language is like that, we wouldn't be able to handle it easily.

That's why switching to gettext was important, to support plural forms (1 hour / 2 hours).

And that's with an already reduced dataset. The full CLDR data is ~329mb

Ooofff D:

2. Lens-Cldr
   
   * Extension shard to lens with full localization support via CLDR data similar to [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb)
   * Only contain the 20 most spoken languages to reduce size

3. Optional language pack shards the user can install to extend Lens-Cldr's language support

Imho, if you plan to provide a CLDR shard, on a maintenance and consistency perspective, it would be better to have everything in a single repo (even if it means a huge one) so PRs and issues are all in the same place. Otherwise, I'm pretty sure that lesser known languages will be often broken or outdated.

The idea would be to manage the "splitting" in the releases: i.e provide the base program, without data, and then provide different types of extensions (for instance "All", "10-most", "20-most", "Europe", "Asia", "Africa", ...) as zip files, so smaller projects may use a reduced set, and larger ones the full set.

SamantazFox avatar Aug 03 '21 11:08 SamantazFox

Also, having many files on disk is not a real problem, as you will never load the hundreds+ language files at once on startup, and instead load them only if someone is requesting that language (Though, as you said, you can, by default, always load the 10 most spoken languages). Some lesser used languages will probably never be loaded at all.

SamantazFox avatar Aug 03 '21 11:08 SamantazFox

That's why switching to gettext was important, to support plural forms (1 hour / 2 hours).

I was actually thinking of some some other language feature, unique to time expressions, and different than plural forms. Though having slept on it, I don't think it's actually a feature present in languages. So, just ignore everything I said here.

Imho, if you plan to provide...so smaller projects may use a reduced set, and larger ones the full set.

Huh good idea. Thanks! Though, I'm most likely going to be typing everything out in crystal to reduce parsing time and to support inheritance.

syeopite avatar Aug 03 '21 17:08 syeopite

Though, I'm most likely going to be typing everything out in crystal to reduce parsing time and to support inheritance.

I've been thinking to that for a bit: wouldn't it be better to use a very generic dataset (maybe YAML, with converters to JSON, TOML, XML and others) that is completely separate from the code itself, so we could provide those datasets independently (in an extremely open type of license, like CC-by-SA, or even CC0), allowing developpers using other programming languages to re-use it, which removes the painful requirement (for them) of gathering data on the hundred of human languages and trying to adapt it for their use.

And if such dataset is already available (that should definitely exist), maybe extracting it and make it generic (if the license allows it, and that isn't already the case) or simply use it.

SamantazFox avatar Aug 05 '21 00:08 SamantazFox

...wouldn't it be better to use a very generic dataset (maybe YAML, with converters to JSON, TOML, XML and others) that...

You literally just defined CLDR lol. It's a project from the Unicode Consortium too, so it's fairly complete. All of the data within it are provided through XML with additional JSON bindings.

But as I mentioned above I'd rather not parse them due to lengthy IO time — though perhaps compiling the XML down to a binary format can help — and to support things like inheritance for reducing file sizes.

syeopite avatar Aug 05 '21 00:08 syeopite

You literally just defined CLDR lol. It's a project from the Unicode Consortium too, so it's fairly complete. All of the data within it are provided through XML with additional JSON bindings.

Oooooh, then you could use that?

But as I mentioned above I'd rather not parse them due to lengthy IO time — though perhaps compiling the XML down to a binary format can help — and to support things like inheritance for reducing file sizes.

Well, you could provide two types of releases:

  • One to use with the raw XML files directly from the CLDR project
  • One with data embedded in crystal code, generated from the raw XML

In both cases, you don't have to keep the data up to date, and users (i.e other developpers) can choose the version that fits their needs :)

SamantazFox avatar Aug 05 '21 16:08 SamantazFox

Out of curiosity I created some JavaScript code to analyze locale files and find files that are missing interpolation (the interpolation is present in the en-US file). It seems like one of the common issues is that some translations are using single quotes instead of back ticks.

Here's the data and the code:

Json data
[
  {
    "fileName": "ar.json",
    "error": "value is missing interpolation",
    "key": "Authorize token for `x`?"
  },
  {
    "fileName": "ar.json",
    "error": "value is missing interpolation",
    "key": "Invidious Private Feed for `x`"
  },
  {
    "fileName": "ar.json",
    "error": "value is missing interpolation",
    "key": "user_created_playlists"
  },
  {
    "fileName": "ar.json",
    "error": "value is missing interpolation",
    "key": "user_saved_playlists"
  },
  {
    "fileName": "ar.json",
    "error": "value is missing interpolation",
    "key": "download_subtitles"
  },
  {
    "fileName": "da.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "da.json",
    "error": "value is missing interpolation",
    "key": "`x` ago"
  },
  {
    "fileName": "da.json",
    "error": "value is missing interpolation",
    "key": "user_saved_playlists"
  },
  {
    "fileName": "es.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Premieres `x`"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "`x` ago"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "`x` uploaded a video"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Updated `x` ago"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Premieres in `x`"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "([^.,0-9]|^)1([^.,0-9]|$)"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": ""
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "`x` is live"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Authorize token for `x`?"
  },
  {
    "fileName": "eu.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "fr.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "nb-NO.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "nb-NO.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "pl.json",
    "error": "value is missing interpolation",
    "key": "`x` is live"
  },
  {
    "fileName": "pl.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "pl.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "pt-PT.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "pt-PT.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "pt-PT.json",
    "error": "value is missing interpolation",
    "key": "Premieres in `x`"
  },
  {
    "fileName": "pt-PT.json",
    "error": "value is missing interpolation",
    "key": "Premieres `x`"
  },
  {
    "fileName": "pt-PT.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "pt.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "pt.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "pt.json",
    "error": "value is missing interpolation",
    "key": "Premieres `x`"
  },
  {
    "fileName": "pt.json",
    "error": "value is missing interpolation",
    "key": "Premieres in `x`"
  },
  {
    "fileName": "pt.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "ro.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "uk.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Authorize token for `x`?"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "`x` uploaded a video"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "`x` is live"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Updated `x` ago"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Delete playlist `x`?"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Editing playlist `x`"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Shared `x`"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "Invidious Private Feed for `x`"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "channel:`x`"
  },
  {
    "fileName": "vi.json",
    "error": "value is missing interpolation",
    "key": "`x` marked it with a ❤"
  }
]
JavaScript code to find missing interpolations
import { readdir, readFile, writeFile } from 'node:fs/promises'

const localesPath = './locales'
const defaultLocale = 'en-US.json'

const errors = [

]

const defaultData = JSON.parse(await readFile(`${localesPath}/${defaultLocale}`, { encoding: 'utf-8' }))
const defaultKeys = Object.keys(defaultData)

const filesInLocaleDir = await readdir(localesPath)

for (const file of filesInLocaleDir) {
    if (file !== defaultLocale) {
        const fileData = JSON.parse(await readFile(`${localesPath}/${file}`, { encoding: 'utf-8' }))
        const fileDataKeys = Object.keys(fileData)
        addErrors(defaultData, fileData, defaultKeys, fileDataKeys, file)
    }
}

writeFile('./node-scripts/locale-errors.json', JSON.stringify(errors, null, 2))

function addErrors(originalData, newData, originalKeys, newKeys, file) {
    newKeys.forEach(fdk => {
        if (!originalKeys.includes(fdk)) {
            // will need to properly handle determining if it's a valid plural first!
            // errors.push({fileName: file, error: 'extra key found', key: fdk})
        } else {
            if (typeof originalData[fdk] === 'object') {
                addErrors(originalData[fdk], newData[fdk], Object.keys(originalData[fdk]), Object.keys(newData[fdk]), file)
            } else if (isMissingInterpolation(originalData[fdk], newData[fdk])) {
                errors.push({fileName: file, error: 'value is missing interpolation', key: fdk })
            }
        }
    })
}

/**
 * 
 * @param {String} defaultValue 
 * @param {String} otherValue 
 */
function isMissingInterpolation(defaultValue, otherValue) {
    if (defaultValue.includes('`x`')) {
        return !otherValue.includes('`x`')
    }
    if (defaultValue.includes('`y`')) {
        return !otherValue.includes('`y`')
    }
}

Let me know if you'd like me to open a PR to add the code to check for locale errors or to update locales with the errors (I know weblate isn't synced automatically to the Invidious repo so updating the values might not be the best idea right now)

ChunkyProgrammer avatar Feb 16 '24 02:02 ChunkyProgrammer