ntextcat
ntextcat copied to clipboard
How to use your library?
Could you give a small example of using your library?
win 7x64 vs - 2017
Installed "ntextcat" through "nuget" I need to determine the language of the text that is entered in "textBox2.Text". Result output in "textBox1.Text" It is supposed to enter the text: European languages, languages with hieroglyphs (Chinese, Japanese) and others
Found sample code.
I get a string error
var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");
cod
using NTextCat;
namespace rsh
{
public partial class Form2 : Form
{
public Form2()
{
InitializeComponent();
}
private void button1_Click(object sender, EventArgs e)
{
var factory = new RankedLanguageIdentifierFactory();
var identifier = factory.Load("NTextCat 0.2.1.1\\LanguageModels\\Core14.profile.xml");
var languages = identifier.Identify(textBox2.Text);
var mostCertainLanguage = languages.FirstOrDefault();
textBox1.Text = mostCertainLanguage.Item1.Iso639_3;
}
}
}
How to solve the problem?
How to detect unsupported languages text as unknown, not to another language. for example "Aţi văzut ce moacă a făcut?" is Romanian, but NTextCat detects it as English.
I don't understand the problem from the description. If your code works correctly, then identifier
would contain the language code (for example, eng
for English).
Perhaps you get an error and could post its screenshot?
@mohammad-khoddami , you can assess how confident NTextCat is with the language tag.
var factory = new RankedLanguageIdentifierFactory();
var identifier = factory.Load("Core14.profile.xml");
var languages = identifier.Identify("some text");
var mostCertainLanguage = languages.FirstOrDefault();
var languageCode = mostCertainLanguage.Item1.Iso639_3;
var confidenceLevel = mostCertainLanguage.Item2;
How is the confidence level measured? I get values like 3495.569 for a long Spanish text that is detected properly
But I get values like 3924.144 for text in Czech which is incorrectly detected as English
Nechť již hříšné saxofony ďáblů rozezvučí síň úděsnými tóny waltzu, tanga a quickstepu.
or 3928.28 for text in Bulgarian which is incorrectly detected as Russian
Ах чудна българска земьо, полюшвай цъфтящи жита.
I suppose the models are not too accurate?
I've tried with Wiki82.profile.xml
and Wiki280.profile.xml
and I get better results with Wiki82.profile.xml
because with Wiki280.profile.xml
the texts are often detected as aa
.
One thing I've noticed is that the detected language ISO code is not correct. With Core14.profile.xml
I get 3 digits code properly in mostCertainLanguage.Item1.Iso639_3
but when using Wiki82.profile.xml
or Wiki280.profile.xml
I get 2 letter code there (which is incorrect).
@ivanakcheurov
Hello, thank you very much for your work.
May I ask about the profiles as well?
-
As was asked above, what the weight numbers mean? As I understood the closer they to 4000 the less accurate they are, but what is the point after which we can consider them as accurate? > 3700, > 3500?
-
I'm using wiki82.profile.xml, and sometimes I'm getting "simple" or "new" language as a result from pure english text. What do they mean?
I suppose this library is abandoned. Any luck @andreyka26-git ?