Robert Sachunsky

Results 735 comments of Robert Sachunsky
trafficstars

Yes there is. For one, on the line level, Tesseract has a longer context to decide over individual characters and words. It uses a language model for this (i.e. a...

> We cannot use GPL code in tesstrain with Apache license. I don't think this short script meets the threshold of originality, though. After all, it just counts characters of...

> Also, it would help to offer a rule for the makefile already. > > Besides, in [314e799](https://github.com/tesseract-ocr/tesstrain/commit/314e799e8d5457d9792ee7e6b62b60b93036e64f) I proposed a similar functionality (only using shell means, i.e. `grep -o...

> Indeed, `ProcessPage` simply converts the given PIL image to pix and calls tesseract's `ProcessPage` API so we shouldn't make such changes to the API. Perhaps providing separate API methods...

@zdenop thanks for your explanation, I had completely overlooked that aspect. Indeed, `ProcessPage` is the only function that allows us to hand in a `Pix` instance in-memory. (Tesseract's `ProcessPages` also...

This parameter is only effective when using `ProcessPage` or `ProcessPages`, as the CLI does.

Wait. What if the original code did not target `*.lstm-unicharset`, but `*.unicharset` (or both)?

> Wait. What if the original code did not target `*.lstm-unicharset`, but `*.unicharset` (or both)? Not relevant: we can only fine-tune from LSTM models, not (purely) Omnifont models.

But there's an **additional issue** previously unnoticed: `PROTO_MODEL`'s `combine_lang_model` recipe expects to see `$(DATA_DIR)/*.unicharset` for every `get_script_from_script_id` in the unicharset table, i.e. `{Common,Latin,Greek,Cyrillic,Hebrew}.unicharset`, and an obscure `Inherited.unicharset`. But the current...