Robert Sachunsky issues

Results 272 issues of


                                            Robert Sachunsky

fix/discuss recommended workflows

I am surprised to see the following in our current recommendations: - Ocropy nlbin instead of one of the Olena algorithms - slow `skimage` binarize/denoise processors instead of Olena/Ocropy -...

Setup guide: better recommendations for Docker usage

The current formulation of the [setup guide](https://ocr-d.github.io/docs/setup#translating-native-commands-to-docker-calls) recommends running the docker image individually for the individual processor CLIs (_translating native commands to docker calls_). This is one possibility, but I...

workflows: recommend parameters and recipes

Sometimes a word on parameter choices would be helpful. For example, - `threshold` (ocrd-cis-ocropy-binarize) or `k` (ocrd-olena-binarize) parameter for binarization, - `maxskew` (`ocrd-cis-ocropy-deskew`) angle, - `find_tables` (`ocrd-tesserocr-segment-region`) - `padding` (in...

OOM in cropper

On a workspace with >500 pages, running the cropper yields a ``` OSError: [Errno 12] Cannot allocate memory ``` This happens after VSZ (virtual memory) exceeds 32 GB. In contrast,...

deskew: respect PAGE coordinate consistency principle

In https://github.com/kba/ocrd_anybaseocr/blob/c65f67e3afc740d70acca18dc3d2c2b778d54d18/ocrd_anybaseocr/cli/ocrd_anybaseocr_deskew.py#L159, the rotation is applied without also enlarging the image respectively. This not only looses information (in the corners), but also violates our consistency principle. Subsequent processors will inevitably...

block segmentation: overlaps and quality of prebuilt models

Once I got the block segmentation to actually run, I was puzzled over the extremely bad results of the provided model. Here's how I gradually worked to isolate the problem....

block segmentation: non-text classes and prebuilt models

In e941321a507ce9f4f6d6416117e441124605748a it seems 3 non-text classes arrived: ImageRegion, TableRegion and GraphicsRegion. However, the `Config.NUM_CLASSES` remained the same, and equally the provided `block_segmentation_weights.h5` still have only 1+14 classes: ``` >>>...

tiseg results not usable

The way in which the trained pixel classifier for text-image segmentation is integrated here makes these predictions completely unusable: - original: ![FILE_0001_ORIGINAL](https://user-images.githubusercontent.com/38561704/106412518-3c7a4100-6448-11eb-9e3c-612eb6251e3b.jpg) - results: | *image part* | *text part*...

periodicals: toc sort order of years different than calendar list view

![boersenblatt-1864-01-01_02](https://user-images.githubusercontent.com/38561704/186683435-0f8822c0-04bd-4ef5-b36f-d587eb6efd06.png) [Link for this example](https://www.boersenblatt-digital.de/pageview?tx_dlf%5Bdouble%5D=0&tx_dlf%5Bid%5D=https%3A%2F%2Fdigital.slub-dresden.de%2Fdata%2Fkitodo%2FBrsfded_39946221X-1864010101_01-t%2FBrsfded_39946221X-1864010101_01-t_year.xml&tx_dlf%5Bpage%5D=1&cHash=2416a44bda547cd465a311f8c090146a) In the list of issues for a year on the left side (table of contents), the order of issues is wrong: In the above example,...

☇ bug

efficient parallel list-list solver

In rapidfuzz there's a [cdist](https://maxbachmann.github.io/RapidFuzz/Usage/process.html#rapidfuzz.process.cdist) function that computes a matrix of alignment scores between each pair of two collections [in parallel](https://github.com/maxbachmann/RapidFuzz/blob/aa6a88fae4ab331d9c05831ec80af8306eb8b6cd/src/rapidfuzz/process_cpp.hpp#L476). Is there something similar in pyalign, too?