tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

In-memory version traineddata doesn't work for multiple traineddata

Open arvintnn opened this issue 2 years ago • 8 comments

Basic Information

Windows 10, tessseract v4, If I passed single traineddata (const char ), it works, but fails for multiple traineddata, by the way multiple traineddata works if I passed the path to traineddata. I append the two trained data together and pass it to the Init, the LoadMemBuffer( bool readsuccesfully_= mgr.LoadMemBuffer(language, data, data_size);) reads the first one and parse it, the entries_[TESSDATA_VERSION] populated correctly, then when it proceeds to the second traineddata, it treats it as path to traineddata, and tries to locate the traineddata on the disk, and failes to initialize correctly // In-memory version reads the traineddata file directly from the given // data[data_size] array. Also implements the version with a datapath in data, // flagged by data_size = 0. int TessBaseAPI::Init(const char data, int data_size, const char* language, OcrEngineMode oem, char** configs, int configs_size, const GenericVector<STRING>* vars_vec, const GenericVector<STRING>* vars_values, bool set_only_non_debug_params, FileReader reader) {

Operating System

Windows 10

Other Operating System

Windows 10, tesseract Windows SDK Version 10.0.17763.0

uname -a

No response

Compiler

MSVC 2017, visual studio version 15.9.25

Virtualization / Containers

No response

CPU

Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 3.70

Current Behavior

No response

Expected Behavior

No response

Suggested Fix

No response

Other Information

No response

arvintnn avatar Feb 16 '23 05:02 arvintnn

Well, the in-memory API was not design to work with multiple traineddata.

amitdo avatar Feb 16 '23 13:02 amitdo

@amitdo , thanks for clearing this up I was struggling to get it work for couple of days. Amitdo is there any way that we can load additional traineddata after Init(...)? Thnkas

arvintnn avatar Feb 16 '23 16:02 arvintnn

is there any way that we can load additional traineddata after Init(...)?

No.

amitdo avatar Feb 16 '23 18:02 amitdo

However, depending on which scripts you are trying to combine, a trainddata file from the script directory may be used as a replacement.

https://github.com/tesseract-ocr/tessdata_fast/tree/main/script

For example: eng+fra => Latin

amitdo avatar Feb 16 '23 18:02 amitdo

@amitdo, Thank you so much for response, I trained for multiple Arabic, fonts. Based on the underlying Image fonts I use combination of two, or more fonts (traineddata) together to get to acceptable accuracy. I am going to give a try for a couple of days to see if there is work around. As you said and I noticed the API doesn't support that, I was thinking to see exactly how pass the path works and maybe I can do it for memory buffer, all the constructors eventually calls the memory buffer constructor, and I figured out the LoadMemBuffer function and how it processed the trained data, but what ever happens after that is challenging

arvintnn avatar Feb 16 '23 19:02 arvintnn

Hi @amitdo I know this is kind of late, but I was able to mange to pass multiple traineddata buffer, and it works. If you are interested I can explain my approach, or provide Code snippets

arvintnn avatar Aug 19 '23 05:08 arvintnn

Hi @arvintnn,

Did you change Tesseract's code? If the answer is yes, the prefered way to send code changes is via pull request. Be aware that it's not guaranteed that your PR will be accepted.

If you somehow found a way solve the issue without changing Tesseract's code, please share the method here.

amitdo avatar Aug 20 '23 07:08 amitdo

Hi Amid, as far as I remember, I created a constructor that passed the vector of cont char* or string, and there was a tiny place 1 or 2 function where I changes the code.  I'll over the code clean it up and issue pull request. Thank You On Sunday, August 20, 2023 at 12:08:57 AM PDT, Amit D. @.***> wrote:

Hi @arvintnn,

Did you change Tesseract's code? If the answer is yes, the prefered way to send code changes is via pull request. Be aware that it's not gurented that your PR will be accepted.

If you somehow found a way solve the issue without changing Tesseract code, please share the method here.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

arvintnn avatar Aug 20 '23 21:08 arvintnn