ntextcat icon indicating copy to clipboard operation
ntextcat copied to clipboard

How to use your library?

Open it19862 opened this issue 6 years ago • 6 comments

Could you give a small example of using your library?

win 7x64 vs - 2017

Installed "ntextcat" through "nuget" I need to determine the language of the text that is entered in "textBox2.Text". Result output in "textBox1.Text" It is supposed to enter the text: European languages, languages ​​with hieroglyphs (Chinese, Japanese) and others

Found sample code. I get a string error var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");

cod

using NTextCat;

namespace rsh
{
    public partial class Form2 : Form
    {
        public Form2()
        {
            InitializeComponent();
        }

        private void button1_Click(object sender, EventArgs e)
        {
            var factory = new RankedLanguageIdentifierFactory();
            var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");
            var languages = identifier.Identify(textBox2.Text);
            var mostCertainLanguage = languages.FirstOrDefault();

            textBox1.Text = mostCertainLanguage.Item1.Iso639_3;
        }
    }
}

How to solve the problem?

2018-10-14_18-48-10

it19862 avatar Oct 14 '18 15:10 it19862

How to detect unsupported languages text as unknown, not to another language. for example "Aţi văzut ce moacă a făcut?" is Romanian, but NTextCat detects it as English.

mohammad-khoddami avatar Apr 06 '20 17:04 mohammad-khoddami

I don't understand the problem from the description. If your code works correctly, then identifier would contain the language code (for example, eng for English). Perhaps you get an error and could post its screenshot?

ivanakcheurov avatar May 06 '20 16:05 ivanakcheurov

@mohammad-khoddami , you can assess how confident NTextCat is with the language tag.

var factory = new RankedLanguageIdentifierFactory();
var identifier = factory.Load("Core14.profile.xml");
var languages = identifier.Identify("some text");
var mostCertainLanguage = languages.FirstOrDefault();

var languageCode = mostCertainLanguage.Item1.Iso639_3;
var confidenceLevel = mostCertainLanguage.Item2;

ivanakcheurov avatar May 06 '20 16:05 ivanakcheurov

How is the confidence level measured? I get values like 3495.569 for a long Spanish text that is detected properly

But I get values like 3924.144 for text in Czech which is incorrectly detected as English

Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.

or 3928.28 for text in Bulgarian which is incorrectly detected as Russian

Ах чудна българска земьо, полюшвай цъфтящи жита.

I suppose the models are not too accurate?


I've tried with Wiki82.profile.xml and Wiki280.profile.xml and I get better results with Wiki82.profile.xml because with Wiki280.profile.xml the texts are often detected as aa.

One thing I've noticed is that the detected language ISO code is not correct. With Core14.profile.xml I get 3 digits code properly in mostCertainLanguage.Item1.Iso639_3 but when using Wiki82.profile.xml or Wiki280.profile.xml I get 2 letter code there (which is incorrect).

diegosasw avatar Apr 12 '22 10:04 diegosasw

@ivanakcheurov

Hello, thank you very much for your work.

May I ask about the profiles as well?

  1. As was asked above, what the weight numbers mean? As I understood the closer they to 4000 the less accurate they are, but what is the point after which we can consider them as accurate? > 3700, > 3500?

  2. I'm using wiki82.profile.xml, and sometimes I'm getting "simple" or "new" language as a result from pure english text. What do they mean? image

andreyka26-git avatar Oct 23 '23 11:10 andreyka26-git

I suppose this library is abandoned. Any luck @andreyka26-git ?

diegosasw avatar Nov 29 '23 15:11 diegosasw