lingua icon indicating copy to clipboard operation
lingua copied to clipboard

Detect multiple languages in mixed-language text

Open pemistahl opened this issue 4 years ago • 5 comments

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input: He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

pemistahl avatar May 25 '20 14:05 pemistahl

I think specifically for Lingua the following approach could work. Some of the following points have footnotes describing further considerations. Note that this is not a scientific approach, there might be better and more performant solutions.

  1. Split text into sections where language switches might occur[^1], this includes:
    • Unicode script changes [^2][^3]
    • Quotation marks[^4][^5]
    • Colon (:)
    • Line and page breaks[^5]
    • ...
  2. For each section determine the set of languages by rules
    1. Try to detect the language with LanguageDetector.detectLanguageWithRules
    2. Otherwise, try to detect the possible languages with LanguageDetector.filterLanguagesByRules
  3. Merge adjacent sections whose language set has the size 1 and which have the same language
  4. For each section determine the confidence values
    • For sections which from the previous steps only have a single language detected by rules the confidence value can be set to 1.0
    • Because accuracy is not good for short texts, merge short texts with subsequent ones if the languages detected with rules permit this (i.e. there must be overlap)[^6]
  5. Merge adjacent sections whose most common languages are also quite common in the respectively other section[^7]

I have implemented this in my fork (file MultiLanguageDetection.kt) and it seems to provide fairly reasonable results, though proper nouns are throwing it off sometimes. When you build with gradlew jarWithDependencies and then start the JAR from console you can also try this out with a GUI (might not follow Swing best practices though). Please let me know what you think, and what areas of it or the general approach outlined above you think could be improved. I would also be interested in how you would have approached this problem. I think it would also be possible to port this to Lingua without much changes (in case you are interested).

[^1]: Might need to impose a minimum length (e.g. 3 letters) to avoid splitting too small sections which can cause issues later on. [^2]: Requires special casing for languages which use more than one script, e.g. Japanese [^3]: Not sure if detecting script changes is always desired, for example should a proper noun in Latin script (e.g. "GitHub") within a Chinese text be really considered a separate text section? [^4]: Might need special casing for quotation marks which are also used as apostrophe (' and U+2019), otherwise this can cause issues for detection of section start and end. For example only consider those characters as quotation marks when not enclosed by letters, or ignore them completely. [^5]: Checking only the char categories (e.g. CharCategory.INITIAL_QUOTE_PUNCTUATION or CharCategory.LINE_SEPARATOR) does not seem to suffice because they do not contain all relevant characters. Therefore the characters have to be hardcoded, e.g. based on https://en.wikipedia.org/wiki/Newline#Unicode and https://en.wikipedia.org/wiki/Quotation_mark#Unicode_code_point_table [^6]: Might also have to look forward to the next section. In case that section is long enough for reliable language detection, check whether current section rather belongs to previous one or next one, to avoid merging it erroneously with previous one. [^7]: The confidence value threshold can be determined based on the number of letters in the section. For short texts languages with lower confidence values (e.g. starting at 0.6) should be considered, whereas for longer texts only languages with high confidence values (close to 1.0) should be considered.

Marcono1234 avatar Aug 29 '22 19:08 Marcono1234

Is it solved? If yes, I would like to see the approach. If not, I have a simple method that I can propose.

kargaranamir avatar Feb 11 '24 15:02 kargaranamir

@kargaranamir I've implemented an algorithm for my other implementations of Lingua (Go, Rust, Python) already. I haven't found the time yet to implement it here. So yes, it's generally solved but not yet implemented.

pemistahl avatar Feb 13 '24 08:02 pemistahl

Thanks for the reply @pemistahl.

I just checked the Python version. In the example, LanguageDetectorBuilder selects languages from three languages, and then it predicts among them. I wonder, does it still work if I run it on all languages supported by Lingua and even if I pass monolingual sentences?

kargaranamir avatar Feb 13 '24 12:02 kargaranamir

@kargaranamir This feature is still experimental. The more languages you add to the mix, the more inaccurate the result will be. If you can restrict the number of possible languages beforehand, then do it as it will produce better results in most cases.

pemistahl avatar Feb 14 '24 10:02 pemistahl