ocrd_cis icon indicating copy to clipboard operation
ocrd_cis copied to clipboard

Segmentation takes hours for a single newspaper page

Open stweil opened this issue 1 year ago • 2 comments
trafficstars

While running QuiVer benchmarks tests the segmentation of a single newspaper page takes several hours. It is still unfinished after 3:30 hours.

Benchmark protocol:

Launching `/app/workflows/workspaces/reichsanzeiger_random_selected_pages_ocr/data/reichsanzeiger_random/selected_pages_ocr.txt.nf` [nice_kare] DSL2 - revision: 8ad3dbf42c
[...]
executor >  local (6)ESC[K
[6c/1653c5] process > ocrd_cis_ocropy_binarize_0 [100%] 1 of 1 ✔ESC[K
[51/f7d705] process > ocrd_tesserocr_crop_1      [100%] 1 of 1 ✔ESC[K
[0e/64d9d0] process > ocrd_skimage_binarize_2    [100%] 1 of 1 ✔ESC[K
[d1/127ed7] process > ocrd_skimage_denoise_3     [100%] 1 of 1 ✔ESC[K
[80/9b6f02] process > ocrd_tesserocr_deskew_4    [100%] 1 of 1 ✔ESC[K
[8f/17f8eb] process > ocrd_cis_ocropy_segment_5  [  0%] 0 of 1ESC[K
[-        ] process > ocrd_cis_ocropy_dewarp_6   -ESC[K
[-        ] process > ocrd_calamari_recognize_7  -ESC[K

Task protocol:

04:04:15.301 INFO processor.OcropySegment - INPUT FILE 0 / P_1879_45_0344
04:04:17.330 INFO processor.OcropySegment - computing line segmentation for page "OCR-D-BIN-DENOISE-DESKEW_1879_45_0344"
04:04:17.330 ERROR processor.OcropySegment - Cannot line-segment page "OCR-D-BIN-DENOISE-DESKEW_1879_45_0344": image too wide for a page image (7086, 10777)
04:04:17.335 INFO processor.OcropySegment - created file ID: OCR-D-SEG_1879_45_0344, file_grp: OCR-D-SEG, path: OCR-D-SEG/OCR-D-SEG_1879_45_0344.xml
04:04:17.335 INFO processor.OcropySegment - INPUT FILE 1 / P_1885_5_0055
04:04:19.555 INFO processor.OcropySegment - computing line segmentation for page "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
[...]
07:42:36.641 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.321 WARNING processor.OcropySegment - Label 188 contour 1 is too small (131/19460) in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.395 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.772 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:37.773 WARNING processor.OcropyResegment - baseline part crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:39.257 WARNING processor.OcropyResegment - baseline part component crosses existing x in region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"
07:42:39.356 INFO processor.OcropySegment - Added region "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055_region0500" with 34 lines for page "OCR-D-BIN-DENOISE-DESKEW_1885_5_0055"

stweil avatar Jan 14 '24 07:01 stweil

Meanwhile ocrd_cis_ocropy_segment runs for more than 4 hours and uses more than 4 GiB of RAM:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                             
2871287 stweil    20   0 9972556   4.7g  90404 R 100.0   3.7 255:36.63 ocrd-cis-ocropy                                                                                                                     

Update: it is still running, obviously without any progress, so that looks like an endless loop:

2871287 stweil    20   0 9898648   4.6g  90404 R 100.0   3.7 322:08.17 ocrd-cis-ocropy                                                                                                                     

Update:

2871287 stweil    20   0 9972556   4.7g  90404 R 100.0   3.7      7,34 ocrd-cis-ocropy                                                                                                                     

I killed the process after 7:34 h.

stweil avatar Jan 14 '24 08:01 stweil

Please post the original image (and the exact workflow leading up to it).

Generally speaking, large image resolution (big pages like newspaper, esp. in combination with high pixel density) is always a problem for image preprocessors. That's why I proposed an architecture for annotating scaled down AlternativeImages in a reusable way for OCR-D – with a well-defined scale factor, so coordinates can still be calculated. Unfortunately, no-one picked up on that (and PRImA let us down), and no-one had time to add internal downscaling to every particular processor. (I did it for ocrd_detectron2, reducing to 300 DPI when the image is ≥600 DPI.)

bertsky avatar Jan 18 '24 01:01 bertsky