lingua-py icon indicating copy to clipboard operation
lingua-py copied to clipboard

Detect multiple languages in mixed-language text

Open pemistahl opened this issue 2 years ago • 2 comments

Currently, for a given input string, only the most likely language is returned. However, if the input contains contiguous sections of multiple languages, it will be desirable to detect all of them and return an ordered sequence of items, where each item consists of a start index, an end index and the detected language.

Input: He turned around and asked: "Entschuldigen Sie, sprechen Sie Deutsch?"

Output:

[
  {"start": 0, "end": 27, "language": ENGLISH}, 
  {"start": 28, "end": 69, "language": GERMAN}
]

pemistahl avatar Jan 22 '22 18:01 pemistahl

I have tried every single package attempting to find a solution like this and none work well. I will write the most flattering Medium article ever written if you get it working with Lingua :P

jturner116 avatar Sep 16 '22 12:09 jturner116

Haha, thanks @jturner116. What higher motivation could I wish for? (-; I'm still in the concept phase for this feature but will try to implement some of it as soon as possible.

pemistahl avatar Sep 16 '22 21:09 pemistahl

@jturner116 I've just released Lingua 1.2.0 that has experimental support for detecting multiple languages in mixed-language text. Perhaps you want to try it. If you do, please let me know what you think about it. Thanks.

pemistahl avatar Dec 19 '22 23:12 pemistahl