go-lang-detector
go-lang-detector copied to clipboard
Take into account most common n-grams.
Take into account only items of mapA that have a value BIGGER than 300. This if statement does the opposite.
hey thanks for the contribution! :)
I believe the condition is correct, as it skips items that have a rank bigger than 300, so only considering top 300 n-grams. The comment on the other hand is wrong, it should say: taking into account only items of mapA, that have a rank smaller than 300 or even better maybe skipping items of mapA, that have a rank bigger than 300
Oh! You are right!
However, using only the first 300 n-grams does not work well for me. I've reverted my changes and these are some of the results I get when comparing a file in Spanish against French, English and Euskara: [{es 0.99} {fr 0.97} {en 0.96} {eu 0.94}]
Notice that almost every language can be a possible candidate. However, by setting the threshold at 3000, the winner is more noticeable:
[{es 0.73} {fr 0.41} {en 0.38} {eu 0.32}]
I've been testing this in different files of different length, and all of them give the same result: all the languages score almost the same.
that's interesting. do you have some text samples for me to test it?
How about we make the threshold (300) configurable, so that you could adapt it to your needs.
I think that's a good solution.
Here is an example of Spanish, Euskara (Basque) and English. https://www.atxaga.eus/es/testuak-textos/adan You can find some corpus to train here http://opus.nlpl.eu/
I've a adapted the develop branch: https://github.com/chrisport/go-lang-detector/commit/a4270979d85f9933c4e80e11c9deb26082bd0bc2
Does that work for you? I feel this repo needs some serious refactoring, most tools used are archived in the meantime :)
Yes, this looks like a good solution. A default value would be nice feature to have as well.
Cheers!