lingua-py icon indicating copy to clipboard operation
lingua-py copied to clipboard

Improve single language detection when words in other languages are quoted

Open schrmh opened this issue 1 year ago • 2 comments

When I put in german sentences with japanese words quoted then it might happen, that lingua claims it's 100% japanese. For example: Wir stoßen an: "かんぱい". Er lächelte. (in english, if you are interested: »We toasted: "kanpai". He smiled«) leads to a ConfidenceValue of 1.0 of japanese. While Wir stoßen an. Er lächelte. has a ConfidenceValue of 0.6014287047855706 for german and 0.0 for japanese (I included all languages for detection).

The expected result in both should be german, maybe with slight japanese confidence in the first case since a japanese word is quoted but it should not be 100% japanese.

schrmh avatar Jan 17 '23 09:01 schrmh

Thanks for reaching out to me. I will try to improve language detection for inputs like yours, even though it's not a trivial problem to solve.

pemistahl avatar Jan 19 '23 10:01 pemistahl

@pemistahl If you could point me in the general area I could look at a few options to test adding this feature.

datatalking avatar May 01 '23 19:05 datatalking