tesseract
tesseract copied to clipboard
PrintVariables produces config file that ReadConfigFile does not properly read
Expected Behavior:
The API function PrintVariables
prints current parameters to a file, and ReadConfigFile
reads parameters from a file. Intuitively, ReadConfigFile
should be able to read the files that PrintVariables
writes. This is explicitly assumed within the ProcessPage
function, where these functions are used together to "Save current config variables before switching modes" and then "Restore saved config variables".
https://github.com/tesseract-ocr/tesseract/blob/a873553593988407b7a9043c30c3010e0d066205/src/api/baseapi.cpp#L1293-L1306
Current Behavior:
Unfortunately, this does not currently work properly. The issue is that PrintVariables
prints parameter descriptions alongside key/value pairs (e.g. chs_trailing_punct1 ).,;:?! 1st Trailing punctuation
), and ReadConfigFile
reads the description as a value (for string parameters). An example showing this is below.
#include <tesseract/baseapi.h>
int main()
{
tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
if (api->Init(NULL, "eng")) {
fprintf(stderr, "Could not initialize tesseract.\n");
exit(1);
}
static const char *kOldVarsFile = "failed_vars.txt";
// Print default value of chs_trailing_punct1
printf("Initial value: %s\n", api->GetStringVariable("chs_trailing_punct1"));
FILE *fp = fopen(kOldVarsFile, "wb");
api->PrintVariables(fp);
fclose(fp);
api->ReadConfigFile(kOldVarsFile);
printf("After PrintVariables/ReadConfigFile: %s\n", api->GetStringVariable("chs_trailing_punct1"));
api->End();
delete api;
return 0;
}
This returns the following:
Initial value: ).,;:?!
After PrintVariables/ReadConfigFile: ).,;:?! 1st Trailing punctuation
The impact of this is:
-
ProcessPage
does not work correctly when used withretry_config
- There is no simple interface for generating a config file with the user's current settings
- This is useful for saving/restoring configurations (as
ProcessPage
attempts to do)
- This is useful for saving/restoring configurations (as
Suggested Fix:
The simplest solution would be to remove the descriptions from the PrintVariables
output (or at least hide that behavior behind an option). I can write a PR if others agree this makes sense. Editing ReadConfigFile
to ignore the descriptions is likely also possible, but could be higher effort.
Environment
Tesseract Version: 5.2.0 Commit Number: https://github.com/tesseract-ocr/tesseract/commit/15200c6fe7733f71a6cf52fbc1e4d94150f9f168 Platform: Linux ubuntu 5.15.0-43-generic
The simplest solution would be to remove the descriptions from the PrintVariables output
We don't want to do this because that function is used by the command line tool and we need the descriptions in that case.
https://github.com/tesseract-ocr/tesseract/blob/0daf18c2028e5217cd996522816c3dd2ec1a4198/src/tesseract.cpp#L721-L724
@amitdo Good point. How about adding an option to PrintVariables
for whether info text should be included (enabled by default, but disabled for the call within ProcessPage
).
How about using:
if (fp == stdout)
or:
(fp == stdout) ? ... : ...
in:
https://github.com/tesseract-ocr/tesseract/blob/60fd2b4abaa9c5c5c42d32db57576bc95d28a78a/src/ccutil/params.cpp#L164
@amitdo Works for me. Added PR #3947.
I confirmed (1) running tesseract --print-parameters
from the command line still prints the info text and (2) the issue demonstrated in the example above is resolved (the value of chs_trailing_punct1
is not changed).
The regression was added in c51691fde in 2014, a long time ago, so all relevant releases are affected.
Obviously the argument retry_config
for the API functions ProcessPages
and ProcessPagesFileList
which triggers the regression was not used for years. Is it useful? If not, we simply might drop the support for it.
If retry_config
is useful, maybe the regression could be fixed by extending ParamUtils::ReadParamsFromFp
to remove the comment part from the input lines. Or even better, TessBaseAPI::ProcessPage
should save and restore parameters in memory without any file on disk.
I think the ability to dump the user's current parameters and easily restore them later is useful (we're implementing a feature in Tesseract.js that tries to do this using PrintVariables
/ReadConfigFile
).
Adding a new function that writes a config file without the info text (suggested in the PR #3947) seems like an easy and effective solution. I looked into whether ReadConfigFile
could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value of page_separator
is \n
).
I think the ability to dump the user's current parameters and easily restore them later is useful
Related issue: #3260
I looked into whether
ReadConfigFile
could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value ofpage_separator
is\n
).
ParamUtils::PrintParams
uses KEY TAB VALUE TAB COMMENT.
https://github.com/tesseract-ocr/tesseract/issues/3670#issuecomment-985273726
This comment and my comments below it are also related to this issue.
ParamUtils::PrintParams uses KEY TAB VALUE TAB COMMENT.
Okay, if we are confident that VALUE will never include tabs, then I can edit ReadConfigFile
to only read until the first tab. If this is not the case, then I can do the separate DumpParameters
function.
@stweil Do you believe it's safe to assume that tabs will not be used as parameter values (so it is safe for ReadConfigFile
to read everything up to the tab and ignore everything after)?
I created a new PR (#4074) based on @stweil's idea to implement by adding a new function in https://github.com/tesseract-ocr/tesseract/pull/3947#pullrequestreview-1146878163.