Robert Sachunsky comments

Results 735 comments of


                                            Robert Sachunsky

trafficstars

GetUTF8Text produces differently on RIL.WORD

Yes there is. For one, on the line level, Tesseract has a longer context to decide over individual characters and words. It uses a language model for this (i.e. a...

Create Character Count from training text

Fixes #221 IINM

Create Character Count from training text

> We cannot use GPL code in tesstrain with Apache license. I don't think this short script meets the threshold of originality, though. After all, it just counts characters of...

Create Character Count from training text

> Also, it would help to offer a rule for the makefile already. > > Besides, in [314e799](https://github.com/tesseract-ocr/tesstrain/commit/314e799e8d5457d9792ee7e6b62b60b93036e64f) I proposed a similar functionality (only using shell means, i.e. `grep -o...

ProcessPage() generates a corrupt file

> Indeed, `ProcessPage` simply converts the given PIL image to pix and calls tesseract's `ProcessPage` API so we shouldn't make such changes to the API. Perhaps providing separate API methods...

ProcessPage() generates a corrupt file

@zdenop thanks for your explanation, I had completely overlooked that aspect. Indeed, `ProcessPage` is the only function that allows us to hand in a `Pix` instance in-memory. (Tesseract's `ProcessPages` also...

Setting variable "tessedit_write_images" has no effect

This parameter is only effective when using `ProcessPage` or `ProcessPages`, as the CLI does.

explicate .lstm-unicharset and my.unicharset prereqs for finetuning

Wait. What if the original code did not target `*.lstm-unicharset`, but `*.unicharset` (or both)?

explicate .lstm-unicharset and my.unicharset prereqs for finetuning

> Wait. What if the original code did not target `*.lstm-unicharset`, but `*.unicharset` (or both)? Not relevant: we can only fine-tune from LSTM models, not (purely) Omnifont models.

explicate .lstm-unicharset and my.unicharset prereqs for finetuning

But there's an **additional issue** previously unnoticed: `PROTO_MODEL`'s `combine_lang_model` recipe expects to see `$(DATA_DIR)/*.unicharset` for every `get_script_from_script_id` in the unicharset table, i.e. `{Common,Latin,Greek,Cyrillic,Hebrew}.unicharset`, and an obscure `Inherited.unicharset`. But the current...