tesseract
tesseract copied to clipboard
In-memory version traineddata doesn't work for multiple traineddata
Basic Information
Windows 10, tessseract v4, If I passed single traineddata (const char ), it works, but fails for multiple traineddata, by the way multiple traineddata works if I passed the path to traineddata. I append the two trained data together and pass it to the Init, the LoadMemBuffer( bool readsuccesfully_= mgr.LoadMemBuffer(language, data, data_size);) reads the first one and parse it, the entries_[TESSDATA_VERSION] populated correctly, then when it proceeds to the second traineddata, it treats it as path to traineddata, and tries to locate the traineddata on the disk, and failes to initialize correctly // In-memory version reads the traineddata file directly from the given // data[data_size] array. Also implements the version with a datapath in data, // flagged by data_size = 0. int TessBaseAPI::Init(const char data, int data_size, const char* language, OcrEngineMode oem, char** configs, int configs_size, const GenericVector<STRING>* vars_vec, const GenericVector<STRING>* vars_values, bool set_only_non_debug_params, FileReader reader) {
Operating System
Windows 10
Other Operating System
Windows 10, tesseract Windows SDK Version 10.0.17763.0
uname -a
No response
Compiler
MSVC 2017, visual studio version 15.9.25
Virtualization / Containers
No response
CPU
Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz 3.70
Current Behavior
No response
Expected Behavior
No response
Suggested Fix
No response
Other Information
No response
Well, the in-memory API was not design to work with multiple traineddata.
@amitdo , thanks for clearing this up I was struggling to get it work for couple of days. Amitdo is there any way that we can load additional traineddata after Init(...)? Thnkas
is there any way that we can load additional traineddata after Init(...)?
No.
However, depending on which scripts you are trying to combine, a trainddata file from the script directory may be used as a replacement.
https://github.com/tesseract-ocr/tessdata_fast/tree/main/script
For example: eng+fra => Latin
@amitdo, Thank you so much for response, I trained for multiple Arabic, fonts. Based on the underlying Image fonts I use combination of two, or more fonts (traineddata) together to get to acceptable accuracy. I am going to give a try for a couple of days to see if there is work around. As you said and I noticed the API doesn't support that, I was thinking to see exactly how pass the path works and maybe I can do it for memory buffer, all the constructors eventually calls the memory buffer constructor, and I figured out the LoadMemBuffer function and how it processed the trained data, but what ever happens after that is challenging
Hi @amitdo I know this is kind of late, but I was able to mange to pass multiple traineddata buffer, and it works. If you are interested I can explain my approach, or provide Code snippets
Hi @arvintnn,
Did you change Tesseract's code? If the answer is yes, the prefered way to send code changes is via pull request. Be aware that it's not guaranteed that your PR will be accepted.
If you somehow found a way solve the issue without changing Tesseract's code, please share the method here.
Hi Amid, as far as I remember, I created a constructor that passed the vector of cont char* or string, and there was a tiny place 1 or 2 function where I changes the code. I'll over the code clean it up and issue pull request. Thank You On Sunday, August 20, 2023 at 12:08:57 AM PDT, Amit D. @.***> wrote:
Hi @arvintnn,
Did you change Tesseract's code? If the answer is yes, the prefered way to send code changes is via pull request. Be aware that it's not gurented that your PR will be accepted.
If you somehow found a way solve the issue without changing Tesseract code, please share the method here.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>