explainers icon indicating copy to clipboard operation
explainers copied to clipboard

Local Dictionary: be clear about language tag processing

Open domenic opened this issue 4 months ago • 5 comments

Language tags are complicated. If my page is set to lang="en-US", and I add dictionary entries for the following languages, what happens?

  • en
  • en-us (lowercase)
  • en-GB
  • en-asdf
  • en-US-x-lolcat
  • en-US-oed
  • en-Latn
  • en-Brai

Exact string matching is generally not the correct approach for this sort of thing.

We struggled with this in various AI-related APIs (see https://github.com/webmachinelearning/translation-api/issues/11) and ended up with some of the specification infrastructure in https://webmachinelearning.github.io/writing-assistance-apis/#supporting-language-tags and https://webmachinelearning.github.io/writing-assistance-apis/#supporting-language-availability , which might be helpful.

domenic avatar Jul 22 '25 02:07 domenic

@domenic Thank you for your significant input and for sharing the related issue. I'm sorry for the late reply. I need to think more about how to handle the language tag.

To make it clearer,

Language tags are complicated. If my page is set to lang="en-US", and I add dictionary entries for the following languages, what happens?

Did "dictionary entries" here refer to the word that will be added to the local dictionary?

I think that processing the language tag for the Local dictionary, whose syntax is as below, would be

addWord(DOMString word, DOMString language)

Let languageTags be an empty ordered set.

  1. If language is specified, then languageTags is language 1-1. The result of validating and canonicalizing a single language tag is added to languageTags
  2. Otherwise, 2-1. If the language type of word matches the language option set in the page, (lang="en-US") languageTags is the language option set in the page. 2-2. Otherwise, languageTags is the language tag without the subtag

The purpose of the Local Dictionary is to deal with proper nouns and the expressions that can be used in general. So the words in the Local Dictionary can be applied to the wider scope of language.

For example, If the page is set to lang="en-US", and the language tag will be

  • document.dictionary.add("colour", "en-UK") : en-UK
  • document.dictionary.add("colour", "en-US") : en-US
  • document.dictionary.add("colour", "en") : en-US, en-UK
  • document.dictionary.add("colour") : en
  • document.dictionary.add("color") : en-US

jihyerish avatar Jul 28 '25 12:07 jihyerish

Did "dictionary entries" here refer to the word that will be added to the local dictionary?

Yes. For example, what happens if I do document.dictionary.add("asdf", "en-asdf") on a page with lang="en-US".

I think that processing the language tag for the Local dictionary, whose syntax is as below, would be

For the 1 branch, for my example of document.dictionary.add("asdf", "en-Latn"), this sets languageTags to the set containing "en-Latn", i.e., English written in the Latin script. So, will it be used for spellchecking on the en-US page, or not?

If the language type of word

What algorithm do you use to determine the language type of word?

domenic avatar Jul 29 '25 05:07 domenic

I'm sorry for the confusion, but for now, we've decided that the language detection feature is not considered in the local dictionary API. So, I would like to change what I mentioned in the previous comment.

For document.dictionary.add(DOMString word, DOMString language), the process of deciding the language tag would be:

Let languageTags be an empty ordered set.

  1. If language is specified, and language is a valid language tag depending on BCP 47 language tags, then languageTags is language
  2. Otherwise, languageTags is the language option set in the page.

Therefore, if the page is set to lang="en-US", the language tags in examples below will be

  • document.dictionary.add("colour", "en-UK") : en-UK
  • document.dictionary.add("colour", "en-US") : en-US
  • document.dictionary.add("colour", "en") : en
  • document.dictionary.add("colour") : en-US
  • document.dictionary.add("color") : en-US

what happens if I do document.dictionary.add("asdf", "en-asdf") on a page with lang="en-US".

The language tag will be "en-US". Because "en-asdf" isn't valid.

For the 1 branch, for my example of document.dictionary.add("asdf", "en-Latn"), this sets languageTags to the set containing "en-Latn", i.e., English written in the Latin script. So, will it be used for spellchecking on the en-US page, or not?

No. Because this was only added in the "en-Latn" dictionary.

I'm aware that the Language Detection API has been shipped since Chrome 138. And I'd like to see how it goes on https://webmachinelearning.github.io/translation-api/, and it would be nice to embrace the language detection feature in the future.

jihyerish avatar Aug 21 '25 15:08 jihyerish

Thanks. Your answers all make sense to me, but, I am not 100% sure they follow best practices.

In particular, consider the following:

<!DOCTYPE html>
<html lang="en-US">

<script>
document.dictionary.add("term1", "en-US");
document.dictionary.add("term2", "en");
</script>

<textarea>term1 term2</textarea>

<textarea lang="en-UK">term1 term2</textarea>

From what you've said above, the code in the <script> will add term1 to the en-US dictionary, and term2 to the en dictionary.

This means that:

  • Inside the first textarea, term2 will get red squigglies (even though en-US is a subtype of en).
  • Inside the second textarea, both term1 and term2 will get red squigglies. (Even though en-UK is a subtype of en, so maybe term2 should not get red squigglies.)

@aphillips may be able to tell us if my intuitions are correct here.

domenic avatar Aug 22 '25 07:08 domenic

Note: the correct tag for the English used in the UK is en-GB

I'm not sure your intuition is correct, @domenic? You are very correct that language tag matching isn't just string comparison and it isn't just prefix comparison. Usually the "tag" (the item being matched) and the "range" (in this case, the dictionary language) are (at least partially) canonicalized to improve matching (for example, the region subtag UK might be converted to GB or Chinese language tags might have the script subtag added, e.g. zh-CN => zh-Hans-CN). Unicode has an algorithm for this and JS Intl uses the same algorithm. Note that canonicalization can also remove subtags, such as when a language has Suppress-Script in the IANA Language Subtag Registry (en suppresses Latn, so en-Latn-US becomes just en-US)

The type of matching also matters. Resource lookup is one type of matching, used when you need to select a single resource (locale, image, string, etc.) for the job. Filtering is used when more than one resource might be applied. Dictionary spell checking looks like it might better apply filtering, so in your example, no terms in the first textarea are red squiggled, because both dictionaries were used for checking--the less specific en one first and then the more specific en-US one. Meanwhile, term1 gets squiggles in the second textarea because the en-US dictionary does not match that language tag (but en does, so term2 is fine).

The dictionary chain for English look sort of like this (en-001 is International English, which is used by CLDR and others to mean approximately "UK spelling traditions with local adaptation"):

             [root]
                ▲
                │
               en
            ▲     ▲
            │     │
        en-US  en-001
                  ▲
                  │
               en-GB

The kind of relationship shown above is complicated. There are points at which you don't want to mix independent language varieties together. Hope this helps?

aphillips avatar Aug 24 '25 00:08 aphillips