tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Migrate Python code to a dedicated package

Open stefan6419846 opened this issue 2 years ago • 10 comments

This is my attempt to migrate the existing code for working with artificial training data to a dedicated Python package, as proposed in #308 and #307. This includes some additional refactoring to the module structure to better encapsulate specific functionality.

I have used version number 0.1 for now, although I am up to changing this.

When migrating, I had two parameters which I am not sure about:

  • overwrite, defaulting to False, does not seem to be used at all.

  • It has not been clear enough for me what extract_font_properties really means and therefore misses documentation. text2image --help did not really help me in this case as well:

    --only_extract_font_properties Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in output_base/[font_name].fontinfo file. (type:bool default:false)

    What would be an appropriate documentation of the parameter?

If there is anything unclear or you want to see anything changed about this, feel free to ask or report.

stefan6419846 avatar May 24 '22 08:05 stefan6419846

Thanks. A dedicated package for training from fonts is a good idea.

You may want to look at the original bash scripts that were sought to be replicated in these python scripts in older versions eg. https://github.com/tesseract-ocr/tesseract/tree/4.0/src/training

'overwrite' if I recall correctly was used for the legacy training offered in the bash scripts.

Shreeshrii avatar May 24 '22 08:05 Shreeshrii

  • linedata = False is a legacy-only parameter which is unsupported, so we might be able to drop it in this process as well.
  • extract_font_properties has always been without any documentation, see usage in https://github.com/tesseract-ocr/tesseract/blob/4.1/src/training/tesstrain.sh#L18-L51 for example.
  • overwrite has been used in make__traineddata only (see https://github.com/tesseract-ocr/tesseract/blob/4.1/src/training/tesstrain_utils.sh#L622-L624). As this method is not available any more, we can probably drop it.

stefan6419846 avatar May 24 '22 09:05 stefan6419846

Font properties (bold, italic, ...) are also from the legacy training and still unsupported with the LSTM recognizer. That's one of the reasons why there remains a certain need for legacy models. The old tesstrain.sh supported training of legacy models, and I think that it would be good to support it in the Python code, too. That should be done separately, not in this pull request here, but maybe you can keep the corresponding parameters with appropriate TODO comments.

stweil avatar May 24 '22 09:05 stweil

@stweil Do you mean that we should keep the existing parameters for now when you are talking about the legacy support? Or does this refer to the linedata parameter only?

stefan6419846 avatar May 24 '22 10:05 stefan6419846

@stweil Do you mean that we should keep the existing parameters for now when you are talking about the legacy support? Or does this refer to the linedata parameter only?

I'd keep all existing parameters for now (with comments).

stweil avatar May 24 '22 10:05 stweil

I have updated the requirements inside the README and fixed the parameters for tesstrain.wrapper.run().

stefan6419846 avatar May 24 '22 11:05 stefan6419846

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 10 '22 11:07 stale[bot]

Is there anything to be changed here to get this PR merged?

stefan6419846 avatar Jul 11 '22 06:07 stefan6419846

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 21 '22 06:09 stale[bot]

Any further update on this?

stefan6419846 avatar Sep 21 '22 07:09 stefan6419846

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 02 '22 00:11 stale[bot]

Can this be merged?

zdenop avatar Jan 05 '23 12:01 zdenop

At least from my side, yes. @stweil had some objections about the .gitignore file, but I did not yet hear back after my latest request for further changes which might be required.

stefan6419846 avatar Jan 08 '23 15:01 stefan6419846

I'd remove any reference to Python 3.6 (see my two comments). .gitignore still contains lots of entries which are not strictly necessary, but that is not critical for merging.

stweil avatar Jan 08 '23 16:01 stweil

I have removed the references to Python 3.6 and updated the README to make clear that Tesseract version 5 is supported as well.

stefan6419846 avatar Jan 08 '23 16:01 stefan6419846

@stweil: what about tagging the previous code/commit as version 1.0? So we can maybe do more reorganization of code without breaking somebody's workflow. I am interested to make training on windows just by using python. If python is required then Auxiliaries are really not needed.

zdenop avatar Jan 08 '23 17:01 zdenop