tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

PrintVariables produces config file that ReadConfigFile does not properly read

Open Balearica opened this issue 2 years ago • 13 comments

Expected Behavior:

The API function PrintVariables prints current parameters to a file, and ReadConfigFile reads parameters from a file. Intuitively, ReadConfigFile should be able to read the files that PrintVariables writes. This is explicitly assumed within the ProcessPage function, where these functions are used together to "Save current config variables before switching modes" and then "Restore saved config variables".

https://github.com/tesseract-ocr/tesseract/blob/a873553593988407b7a9043c30c3010e0d066205/src/api/baseapi.cpp#L1293-L1306

Current Behavior:

Unfortunately, this does not currently work properly. The issue is that PrintVariables prints parameter descriptions alongside key/value pairs (e.g. chs_trailing_punct1 ).,;:?! 1st Trailing punctuation), and ReadConfigFile reads the description as a value (for string parameters). An example showing this is below.

#include <tesseract/baseapi.h>

int main()
{
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    static const char *kOldVarsFile = "failed_vars.txt";

    // Print default value of chs_trailing_punct1
    printf("Initial value: %s\n", api->GetStringVariable("chs_trailing_punct1"));
    FILE *fp = fopen(kOldVarsFile, "wb");
    api->PrintVariables(fp);
    fclose(fp);
    api->ReadConfigFile(kOldVarsFile);
    printf("After PrintVariables/ReadConfigFile: %s\n", api->GetStringVariable("chs_trailing_punct1"));

    api->End();
    delete api;
    return 0;
}

This returns the following:

Initial value: ).,;:?!
After PrintVariables/ReadConfigFile: ).,;:?!	1st Trailing punctuation

The impact of this is:

  1. ProcessPage does not work correctly when used with retry_config
  2. There is no simple interface for generating a config file with the user's current settings
    1. This is useful for saving/restoring configurations (as ProcessPage attempts to do)

Suggested Fix:

The simplest solution would be to remove the descriptions from the PrintVariables output (or at least hide that behavior behind an option). I can write a PR if others agree this makes sense. Editing ReadConfigFile to ignore the descriptions is likely also possible, but could be higher effort.


Environment

Tesseract Version: 5.2.0 Commit Number: https://github.com/tesseract-ocr/tesseract/commit/15200c6fe7733f71a6cf52fbc1e4d94150f9f168 Platform: Linux ubuntu 5.15.0-43-generic

Balearica avatar Oct 15 '22 03:10 Balearica

The simplest solution would be to remove the descriptions from the PrintVariables output

We don't want to do this because that function is used by the command line tool and we need the descriptions in that case.

https://github.com/tesseract-ocr/tesseract/blob/0daf18c2028e5217cd996522816c3dd2ec1a4198/src/tesseract.cpp#L721-L724

amitdo avatar Oct 15 '22 16:10 amitdo

@amitdo Good point. How about adding an option to PrintVariables for whether info text should be included (enabled by default, but disabled for the call within ProcessPage).

Balearica avatar Oct 15 '22 20:10 Balearica

How about using:

if (fp == stdout)

or:

(fp == stdout) ? ... : ...

in:

https://github.com/tesseract-ocr/tesseract/blob/60fd2b4abaa9c5c5c42d32db57576bc95d28a78a/src/ccutil/params.cpp#L164

amitdo avatar Oct 16 '22 06:10 amitdo

@amitdo Works for me. Added PR #3947.

I confirmed (1) running tesseract --print-parameters from the command line still prints the info text and (2) the issue demonstrated in the example above is resolved (the value of chs_trailing_punct1 is not changed).

Balearica avatar Oct 19 '22 03:10 Balearica

The regression was added in c51691fde in 2014, a long time ago, so all relevant releases are affected.

stweil avatar Oct 19 '22 06:10 stweil

Obviously the argument retry_config for the API functions ProcessPages and ProcessPagesFileList which triggers the regression was not used for years. Is it useful? If not, we simply might drop the support for it.

If retry_config is useful, maybe the regression could be fixed by extending ParamUtils::ReadParamsFromFp to remove the comment part from the input lines. Or even better, TessBaseAPI::ProcessPage should save and restore parameters in memory without any file on disk.

stweil avatar Oct 19 '22 06:10 stweil

I think the ability to dump the user's current parameters and easily restore them later is useful (we're implementing a feature in Tesseract.js that tries to do this using PrintVariables/ReadConfigFile).

Adding a new function that writes a config file without the info text (suggested in the PR #3947) seems like an easy and effective solution. I looked into whether ReadConfigFile could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value of page_separator is \n).

Balearica avatar Oct 19 '22 07:10 Balearica

I think the ability to dump the user's current parameters and easily restore them later is useful

Related issue: #3260

amitdo avatar Oct 19 '22 07:10 amitdo

I looked into whether ReadConfigFile could be tweaked to ignore the info text, however this seems potentially tricky/error-prone. My first thought was to read everything to the first space character as the parameter value, however at least one parameter uses space character(s) as values (the default value of page_separator is \n).

ParamUtils::PrintParams uses KEY TAB VALUE TAB COMMENT.

stweil avatar Oct 19 '22 08:10 stweil

https://github.com/tesseract-ocr/tesseract/issues/3670#issuecomment-985273726

This comment and my comments below it are also related to this issue.

amitdo avatar Oct 19 '22 08:10 amitdo

ParamUtils::PrintParams uses KEY TAB VALUE TAB COMMENT.

Okay, if we are confident that VALUE will never include tabs, then I can edit ReadConfigFile to only read until the first tab. If this is not the case, then I can do the separate DumpParameters function.

Balearica avatar Oct 20 '22 02:10 Balearica

@stweil Do you believe it's safe to assume that tabs will not be used as parameter values (so it is safe for ReadConfigFile to read everything up to the tab and ignore everything after)?

Balearica avatar Nov 04 '22 03:11 Balearica

I created a new PR (#4074) based on @stweil's idea to implement by adding a new function in https://github.com/tesseract-ocr/tesseract/pull/3947#pullrequestreview-1146878163.

Balearica avatar May 10 '23 02:05 Balearica