tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

run tesseract with multiple languages at one time C#

Open Sam-krd opened this issue 4 years ago • 3 comments

How does tesseract work with multiple languages text?

I installed Tesseract 4.1.1 by Charles weld, from NuGet package manager, but i can run the engine over one language file

Here is my code:

var img = new Bitmap(Open_Image_File.FileName); var ocr = new TesseractEngine("./tessdata", "eng", EngineMode.LstmOnly); var page = ocr.Process(img); txtres.Text = page.GetText();

I am wondering if someone could assist to use two ore three languages at the same time, for example (English and Arabic) together?

Sam-krd avatar Feb 16 '21 12:02 Sam-krd

I’ve never used multiple languages, but the source code for the C++ tesseract project states the following syntax. Contact tesseract-ocr on Google Groups for more information.

// Parse a string of the form [~][+[~]]*.

// Langs with no prefix get appended to to_load, provided they

// are not in there already.

// Langs with ~ prefix get appended to not_to_load, provided they are not in

// there already.

tdhintz avatar Feb 16 '21 13:02 tdhintz

I’ve never used multiple languages, but the source code for the C++ tesseract project states the following syntax. Contact tesseract-ocr on Google Groups for more information. // Parse a string of the form [~][+[~]]*. // Langs with no prefix get appended to to_load, provided they // are not in there already. // Langs with ~ prefix get appended to not_to_load, provided they are not in // there already.

Im also trying to do so and this doesnt seem to work for me. Im creating a engine like this:

new TesseractEngine("someFolder", "rus+eng", EngineMode.LstmOnly);

This results in only russian characters being read. Using "eng+rus" results in only english characters being read.

Blightbuster avatar Jan 15 '22 16:01 Blightbuster

I’ve never used multiple languages, but the source code for the C++ tesseract project states the following syntax. Contact tesseract-ocr on Google Groups for more information. // Parse a string of the form [~][+[~]]*. // Langs with no prefix get appended to to_load, provided they // are not in there already. // Langs with ~ prefix get appended to not_to_load, provided they are not in // there already.

Im also trying to do so and this doesnt seem to work for me. Im creating a engine like this:

new TesseractEngine("someFolder", "rus+eng", EngineMode.LstmOnly);

This results in only russian characters being read. Using "eng+rus" results in only english characters being read.

For me the issue was that I was using models from tesdata_fast. Using models from tesdata_best solved the issue.

Blightbuster avatar Jan 18 '22 20:01 Blightbuster