lingua-rs icon indicating copy to clipboard operation
lingua-rs copied to clipboard

Detect multiple languages in mixed-language text

Open pemistahl opened this issue 3 years ago • 1 comments

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input: He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

pemistahl avatar Nov 15 '20 21:11 pemistahl

This would be beneficial with text written by minority indigenous speakers who comfortably switch back and forth between their native language and the majority national language, e.g. Maori <-> English (New Zealand) or Navajo <-> English (American Southwest).

eekkaiia avatar Nov 24 '20 13:11 eekkaiia

Hi! What have you thought about solving this? So basically this is what I thought, let me know if this is going in the correct direction or not.

We can use the fact that the language detected will have the minimum out of place n-grams.

  1. We can use a window of P (discussed later), and divide the string into P words first.

For example, a string like "My name is Alex and I like to play football" with P=3 will be:- [My name is, name is Alex, is Alex and, and I like, .....]

We can then try using n-gram categorization on each of the string to get the most likely language for each window.

  1. Next, we can try reducing the value of P, to P-1, P-2, P-3 or whatever interval and try the same process again, in order to "squeeze" out the location when there is a change in the language.

This was just a rough idea, let me know if it makes sense or not.

HridayM25 avatar Mar 07 '23 01:03 HridayM25

Hi @HridayM25, I've just implemented something similar to your approach. Feel free to try it out and let me know about any improvements you would add.

pemistahl avatar Apr 02 '23 12:04 pemistahl