go-lang-detector icon indicating copy to clipboard operation
go-lang-detector copied to clipboard

Take into account most common n-grams.

Open zolastro opened this issue 4 years ago • 6 comments

Take into account only items of mapA that have a value BIGGER than 300. This if statement does the opposite.

zolastro avatar Dec 01 '20 08:12 zolastro

hey thanks for the contribution! :)

I believe the condition is correct, as it skips items that have a rank bigger than 300, so only considering top 300 n-grams. The comment on the other hand is wrong, it should say: taking into account only items of mapA, that have a rank smaller than 300 or even better maybe skipping items of mapA, that have a rank bigger than 300

chrisport avatar Dec 01 '20 11:12 chrisport

Oh! You are right!

However, using only the first 300 n-grams does not work well for me. I've reverted my changes and these are some of the results I get when comparing a file in Spanish against French, English and Euskara: [{es 0.99} {fr 0.97} {en 0.96} {eu 0.94}]

Notice that almost every language can be a possible candidate. However, by setting the threshold at 3000, the winner is more noticeable:

[{es 0.73} {fr 0.41} {en 0.38} {eu 0.32}]

I've been testing this in different files of different length, and all of them give the same result: all the languages score almost the same.

zolastro avatar Dec 01 '20 12:12 zolastro

that's interesting. do you have some text samples for me to test it?

How about we make the threshold (300) configurable, so that you could adapt it to your needs.

chrisport avatar Dec 09 '20 16:12 chrisport

I think that's a good solution.

Here is an example of Spanish, Euskara (Basque) and English. https://www.atxaga.eus/es/testuak-textos/adan You can find some corpus to train here http://opus.nlpl.eu/

zolastro avatar Dec 11 '20 08:12 zolastro

I've a adapted the develop branch: https://github.com/chrisport/go-lang-detector/commit/a4270979d85f9933c4e80e11c9deb26082bd0bc2

Does that work for you? I feel this repo needs some serious refactoring, most tools used are archived in the meantime :)

chrisport avatar Dec 27 '20 12:12 chrisport

Yes, this looks like a good solution. A default value would be nice feature to have as well.

Cheers!

zolastro avatar Feb 16 '21 11:02 zolastro