rmast

Results 184 comments of rmast

I'm now on a track for finding the cause of the double 'wis - clear'. The second row of block 28 gives 4 words: A blob of the full row,...

-c edges_use_new_outline_complexity=1 doesn't solve these issues.

There appears to be something wrong with the decisionmaking around good and bad (rejected) blobs: ``` diff --git a/src/textord/tordmain.cpp b/src/textord/tordmain.cpp index a7f2a168f..97952f1bd 100644 --- a/src/textord/tordmain.cpp +++ b/src/textord/tordmain.cpp @@ -668,12 +668,33...

During this part of processing good is still good and rejected is still rejected (parents are rejected, children are coming by): ![In het bovenste deel zijn de goeden nog goed](https://user-images.githubusercontent.com/3341558/185348899-ebc4165b-66be-4c87-ae6c-02c03a1443ab.png)...

Just killing the non-inverted parents in stepblob.cpp solves the issue for both lines: ``` > print > wis - clear diff --git a/src/ccstruct/stepblob.cpp b/src/ccstruct/stepblob.cpp index 4c61b6c65..aac639747 100644 --- a/src/ccstruct/stepblob.cpp +++...

When the parents are left as in the original code during make_prop words there are much to much blobs per row left. For the '> print' row there are blobs...

I tried the effects of killing the parents on 5.1.0 with the full page using ocrmypdf. ocrmypdf --image-dpi 300 --pdfa-image-compression lossless -O0 ../rmast/175789293-f39ddfdb-6f3e-4598-8d16-80a1f4a88b36.jpg formulierhocrjpgmetpatch5.1.0.pdf For some reason the resulting selection...

With 5.2.0 default settings the inverted Toelichting 2.1 is correctly read, however with none of the versions the bottom line with the ®-sign is complete.

Yes, Zathura makes a mess of the selection, not clearly showing what lines are selected or not.

Please let us know if you find an open source automatic segmenter that generally and unattendedly does a better job than Tesseract itself. I guess that would be a hit....