go-lang-detector
go-lang-detector copied to clipboard
English detector fails when checking czech text
English detector fails when checking czech text:
package main
import ( "fmt" "github.com/chrisport/go-lang-detector/langdet" "github.com/chrisport/go-lang-detector/langdet/langdetdef" )
var isEnglishDetector langdet.Detector
func isEnglish(text string) bool { if len(isEnglishDetector.Languages) == 0 { fmt.Println("* Init English detector ...") isEnglishDetector = langdetdef.NewWithDefaultLanguages() }
if isEnglishDetector.GetClosestLanguage(text) == "english" {
return true
}
return false
}
func main() { fmt.Println(isEnglish("do not care about quantity")) fmt.Println(isEnglish("V jeho jednomyslném schválení však brání dlouhodobý nesouhlas dvojice zmíněných států. „Slyším tak často z Polska a Maďarska, že nemají problém s právním státem, až bych skoro čekala, že to dokážou tím, že pro to zvednou ruku,“ prohlásila. (ČTK)*")) fmt.Println(isEnglish("Jesteśmy przekonani, że właśnie taki rodzaj dziennikarstwa najlepiej pomaga rozumieć to, co dzieje się dookoła nas i stanowi najbardziej wartościowy wkład w rozwój demokracji oraz wartości obywatelskich")) }
OUTPUT:
- Init English detector ... true true true
that's interesting. Actually the confidence for English, German and French for your snippets is quite high, which means, these languages share similarities and to distinguish them, you would need to set a higher minimum confidence.
If you print the confidence by using GetLanguages
rather than GetClosestLanguage
, you will see this:
[{english 90} {french 75} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]
So you have 2 possibilities here:
Option 1 Increase the Minimum confidence to let's say 85 --> now it will correctly return:
english
undefined
undefined
This will work well, if your detector should only detect English and you don't care about Czech so much.
Option 2 Add the Czech language to the detector. I did so by using a random Wikipedia article, copied it in a text file and analysed it using the library. See also the Readme on how to do that. The result will be:
english
czech
czech
Confidence levels:
[{english 90} {french 75} {Czech 55} {german 50} {turkish 44} {hebrew 19} {arabic 1} {russian 0} {CJK 0}]
[{Czech 94} {english 80} {german 80} {french 79} {turkish 79} {hebrew 73} {arabic 69} {russian 68} {CJK 0}]
[{Czech 76} {english 76} {german 76} {french 73} {turkish 72} {hebrew 67} {arabic 62} {russian 61} {CJK 0}]
I hope this was helpful to you, please let me know if I can support you in your specific use-case.
Maybe this has something to do with #24