tesstrain icon indicating copy to clipboard operation
tesstrain copied to clipboard

Add --vertical_fontlist option to tesstrain.py

Open nagadomi opened this issue 3 years ago • 5 comments

Porting from https://github.com/tesseract-ocr/tesseract/pull/3434 (not merged) .

This Pull Request adds --vertical_fontlist option to tesstrain.sh to specify a list of fontnames to render vertical text. The format for specifying a list of fontnames is the same as for --font_list option. If --vertical_fontlist <FONTS> option is specified, it will override the VERTICLA_FONTS variable(defined in language-specific.sh) with the specified list of fontnames.

In the current version, the VERTICAL_FONTS variable is hardcoded in language-specific.sh. So, when creating training data for vertical text such as Japanese, users need to edit the source code even if they specify a list of fontnames with --fontlist and --font_dir.

nagadomi avatar May 19 '21 01:05 nagadomi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 18 '21 02:06 stale[bot]

Looks like this was closed accidentally

bertsky avatar Jul 15 '21 12:07 bertsky

For vertical text, it is hard to use without this option. However, I have developed a more powerful script to replace text2image/tesstrain.py, so I don't use tesstrain.py anymore.

nagadomi avatar Jul 15 '21 15:07 nagadomi

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 18 '21 02:08 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 08 '22 22:01 stale[bot]

If I understand "Makefile training" correctly, this "src/tesstrain" is not used for training at the moment. AFAIK these python script are based on old shell training scripts. I suggest to keep them and maybe we can find the way how to integrate it to current training process...

zdenop avatar Jan 09 '23 15:01 zdenop

@nagadomi : can you update this PR to recent git code?

zdenop avatar Jan 09 '23 15:01 zdenop

According to my request in #307, there basically are two ways to train Tesseract inside this repository:

  • The Makefile-based approach to train on real data.
  • The Python-based approach to train on artificial data, corresponding to the old Bash-based approach.

My suggestion had been to actually document this somewhere as this is not always clear, but due to the harsh stale automation and (at least in the past) rather restricted responses, this has been buried into closed issues.

stefan6419846 avatar Jan 09 '23 16:01 stefan6419846

As I mentioned in other PR - I am interested in python based training as "make training" is difficult run on windows and it requires tool that could be easily replaced by python (unzip, wget, bc ...). I would suggest to merge/rework current open PR and then move forward (review issues/PR marked as "stale"...

zdenop avatar Jan 09 '23 16:01 zdenop

committed in https://github.com/tesseract-ocr/tesstrain/commit/2c7c6e8feaf8aa1f2d1750b689fa46473453885e

zdenop avatar Jan 25 '23 15:01 zdenop