kraken icon indicating copy to clipboard operation
kraken copied to clipboard

Unable to segment specific images

Open sixtyfive opened this issue 2 years ago • 27 comments

This problem occurs with 11 out of a set of ~646 PNGs, all of which plopped out of the exact same processing pipeline, scanned on exactly the same hardware.

Both models (seg & rec) trained from binary_datasets branch about a week ago.

$ pip show scikit-image
Name: scikit-image
Version: 0.17.2

$ kraken --version # master branch
kraken, version 3.0.7

$ kraken -d cuda:0 -i vol02_page0002_f002.png vol02_page0002_f002.xml -a segment -bl -i ~/.../seg_best.mlmodel ocr -m ~/.../rec_best.mlmodel
[0.0086] Baseline model (~/.../seg_best.mlmodel) given but legacy segmenter selected. Forcing to -bl. 
WARNING:root:Torch version 1.10.1+cu102 has not been tested with coremltools. You may run into unexpected errors. Torch 1.9.1 is the most recent version that has been tested.
Loading ANN ~/.../seg_best.mlmodel     ✓
Loading ANN default     ✓
Segmenting      [19.1791] Polygonizer failed on line 0: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s) 
[19.2210] Polygonizer failed on line 0: all the input arrays must have same number of dimensions, but the array at index 0 has 1 dimension(s) and the array at index 1 has 2 dimension(s) 
✓
Processing  [####################################]  100%          
Writing recognition results for vol02_page0002_f002.png TopologyException: side location conflict at 2232 4431.6923076923076
[38.5192] Failed processing vol02_page0002_f002.png: No Shapely geometry can be created from null value 
Traceback (most recent call last):
  File "/home/escriptorium/escriptorium/env/bin/kraken", line 8, in <module>
    sys.exit(cli())
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/click/core.py", line 1691, in invoke
    return _process_result(rv)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/click/core.py", line 1628, in _process_result
    value = ctx.invoke(self._result_callback, value, **ctx.params)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/kraken/kraken.py", line 380, in process_pipeline
    task(input=input, output=output)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/kraken/kraken.py", line 252, in recognizer
    ctx.meta['output_mode']))
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/kraken/serialization.py", line 204, in serialize
    pols = unary_union(pols)
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/shapely/ops.py", line 161, in unary_union
    return geom_factory(lgeos.methods['unary_union'](collection))
  File "/home/escriptorium/escriptorium/env/lib/python3.7/site-packages/shapely/geometry/base.py", line 73, in geom_factory
    raise ValueError("No Shapely geometry can be created from null value")
ValueError: No Shapely geometry can be created from null value

But others :

$ kraken -d cuda:0 -i vol02_page0212_f001.png vol02_page0212_f001.xml -a segment -bl -i ~/.../seg_best.mlmodel ocr -m ~/.../rec_best.mlmodel
[0.0123] Baseline model (~/.../seg_best.mlmodel) given but legacy segmenter selected. Forcing to -bl. 
WARNING:root:Torch version 1.10.1+cu102 has not been tested with coremltools. You may run into unexpected errors. Torch 1.9.1 is the most recent version that has been tested.
Loading ANN ~/.../seg_best.mlmodel     ✓
Loading ANN default     ✓
Segmenting      ✓
Processing  [####################################]  100%          
Writing recognition results for vol02_page0212_f001.png ✓

sixtyfive avatar Jan 27 '22 15:01 sixtyfive

Hm, I can't reproduce the error. Both commands run through both on CPU and GPU. Could you give me the full dump of installed packages (pip list)?

mittagessen avatar Jan 31 '22 15:01 mittagessen

CPU vs. GPU makes no difference here, either.

Requested list for that env:

$ pip list
Package                Version
---------------------- ----------
aioredis               1.3.1
albumentations         1.1.0
amqp                   2.5.2
asgiref                3.4.1
async-timeout          4.0.2
attrs                  21.4.0
autobahn               21.11.1
Automat                20.2.0
beautifulsoup4         4.7.1
billiard               3.6.4.0
bleach                 3.1.5
celery                 4.4.0rc4
certifi                2021.10.8
cffi                   1.15.0
channels               2.4.0
channels-redis         2.4.2
chardet                3.0.4
click                  8.0.3
constantly             15.1.0
coremltools            5.1.0
cryptography           36.0.1
cycler                 0.11.0
daphne                 2.5.0
Django                 2.2.26
django-cleanup         5.1.0
django-ordered-model   3.1.1
django-prometheus      2.2.0
django-ranged-response 0.2.0
django-redis           4.10.0
django-simple-captcha  0.5.12
djangorestframework    3.9.2
drf-nested-routers     0.91
easy-thumbnails        2.5
fonttools              4.28.5
hiredis                2.0.0
hyperlink              21.0.0
idna                   2.8
imageio                2.13.5
importlib-metadata     4.10.0
importlib-resources    5.4.0
incremental            21.3.0
Jinja2                 3.0.3
jsonschema             4.3.3
kiwisolver             1.3.2
kombu                  4.6.6
kraken                 3.0.7
lxml                   4.7.1
MarkupSafe             2.0.1
matplotlib             3.5.1
mpmath                 1.2.1
msgpack                0.6.2
networkx               2.6.3
numpy                  1.21.5
opencv-python-headless 4.5.5.62
packaging              21.3
Pillow                 9.0.0
pip                    21.3.1
pkg_resources          0.0.0
prometheus-client      0.12.0
protobuf               3.19.1
psycopg2-binary        2.7.6
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pycparser              2.21
pyOpenSSL              21.0.0
pyparsing              3.0.6
pyrsistent             0.18.0
python-bidi            0.4.2
python-dateutil        2.8.2
pytz                   2021.3
pyvips                 2.1.12
PyWavelets             1.2.0
PyYAML                 6.0
qudida                 0.0.4
redis                  3.2.1
regex                  2021.11.10
requests               2.21.0
scikit-image           0.17.2
scikit-learn           0.19.2
scipy                  1.5.2
service-identity       21.1.0
setuptools             44.0.0
Shapely                1.8.0
six                    1.16.0
soupsieve              2.3.1
sqlparse               0.4.2
sympy                  1.9
tifffile               2021.11.2
torch                  1.10.1
torchvision            0.11.2
tqdm                   4.62.3
Twisted                21.7.0
txaio                  21.2.1
typing_extensions      4.0.1
urllib3                1.24.3
uWSGI                  2.0.18
vine                   1.3.0
webencodings           0.5.1
wheel                  0.34.2
zipp                   3.7.0
zope.interface         5.4.0
WARNING: You are using pip version 21.3.1; however, version 22.0.2 is available.
You should consider upgrading via the '/home/escriptorium/escriptorium/env/bin/python -m pip install --upgrade pip' command.

sixtyfive avatar Jan 31 '22 16:01 sixtyfive

Not sure if it's related: I wanted to recognize some other images and did so with the 3.0.7 that comes with eScript (same as above), which worked fine.

Then I got curious and tried again with another, freshly installed, 3.0.7 with its own pyenv, and got this:

WARNING:root:Torch version 1.10.2+cu102 has not been tested with coremltools. You may run into unexpected errors. Torch 1.9.1 is the most recent version that has been tested.
Loading ANN /home/escriptorium/escriptorium/media/documents/tavo_99/segmodels/tavo_seg_best.mlmodel     ✓
Loading ANN default     ✓
Segmenting      ✗                                                                              
[28.3720] Failed processing vol03_page0045_f001.png: ndim
...

That one has a much shorter package list:

$ pip list
Package             Version  
------------------- ---------
attrs               21.4.0   
certifi             2021.10.8
charset-normalizer  2.0.11   
click               8.0.3    
coremltools         5.1.0    
idna                3.3      
imageio             2.14.1   
importlib-metadata  4.10.1   
importlib-resources 5.4.0    
Jinja2              3.0.3    
jsonschema          4.4.0    
kraken              3.0.7    
lxml                4.7.1    
MarkupSafe          2.0.1    
mpmath              1.2.1    
networkx            2.6.3    
numpy               1.21.5   
packaging           21.3     
Pillow              9.0.0    
pip                 20.0.2   
pkg-resources       0.0.0    
protobuf            3.19.4   
pyparsing           3.0.7    
pyrsistent          0.18.1   
python-bidi         0.4.2    
PyWavelets          1.2.0    
regex               2022.1.18
requests            2.27.1   
scikit-image        0.19.1   
scipy               1.7.3    
setuptools          44.0.0   
Shapely             1.8.0    
six                 1.16.0   
sympy               1.9      
tifffile            2021.11.2
torch               1.10.2   
torchvision         0.11.3   
tqdm                4.62.3   
typing-extensions   4.0.1    
urllib3             1.26.8   
wheel               0.34.2   
zipp                3.7.0

sixtyfive avatar Feb 02 '22 19:02 sixtyfive

On 22/02/02 11:11AM, J. R. Schmid wrote:

Not sure if it's related: I wanted to recognize some other images and did so with the 3.0.7 that comes with eScript (same as above), which worked fine. Then I got curious and tried again with another, freshly installed, 3.0.7 with its own pyenv, and got this:

Yeah, that's the scikit-image fix that wasn't in a stable release yet. I've cherry-picked it from the binary_dataset branch into master and tagged a new release 3.0.8. That should at least deal with the crash below.

mittagessen avatar Feb 03 '22 10:02 mittagessen

Ah woops, sorry, had forgotten about that one!

sixtyfive avatar Feb 03 '22 11:02 sixtyfive

I get different kinds of TopologyException (67 up to now while processing 6800 images from UB Tübingen) with latest kraken release. Two examples from first and second logfile:

Segmenting      ✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 37/37 0:00:00 0:00:23
Writing recognition results for baur1834/baur1834_00298.jpg     ✓
Segmenting      TopologyException: unable to assign free hole to a shell at 728 1436
[08/01/22 09:39:15] WARNING  Polygonizer failed on line 0:   segmentation.py:706
                             No Shapely geometry can be                         
                             created from null value                            
✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 37/37 0:00:00 0:00:22
Writing recognition results for baur1834/baur1834_00299.jpg     ✓
[...]
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 101/101 0:00:00 0:00:11
Writing recognition results for dzcw_1857/dzcw_1857_0420.jpg    ✓
Segmenting      TopologyException: side location conflict at 1292.5833333333333 1371.6666666666667. This can occur if the input geometry is invalid.
[08/03/22 07:21:50] WARNING  Polygonizer failed on line 0:   segmentation.py:710
                             No Shapely geometry can be                         
                             created from null value                            
✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 99/99 0:00:00 0:00:12
Writing recognition results for dzcw_1857/dzcw_1857_0421.jpg    ✓

All images and the ALTO results are available online. The OCR process is documented here. List of installed packages: https://ub-backup.bib.uni-mannheim.de/~stweil/digitue/requirements.txt.

stweil avatar Aug 03 '22 05:08 stweil

This issue is nagging, as it creates an empty result. In mass production 11 of 10146 ALTO files were empty because of it, in another one 308 of 43011, so it can occur rather often. In some cases processing the same image with a different model helps. I now examined it closer with instrumented / modified kraken code.

This is the test case which fails with unmodified kraken git master in an fresh virtual Python environment (see log file):

Segmenting      ✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 37/37 0:00:00 0:00:22
Writing recognition results for feilmoser1830/feilmoser1830_00519.jpg   TopologyException: side location conflict at 129.6338329764454 397.07922912205566. This can occur if the input geometry is invalid.
[08/03/22 15:43:03] ERROR    Failed processing                     kraken.py:396
                             feilmoser1830/feilmoser1830_00519.jpg              
                             : No Shapely geometry can be created               
                             from null value                                    

Now kraken was patched:

diff --git a/kraken/serialization.py b/kraken/serialization.py
index f36a23c..908a842 100644
--- a/kraken/serialization.py
+++ b/kraken/serialization.py
@@ -209,9 +209,14 @@ def serialize(records: Sequence[ocr_record],
                     if pol.area == 0.0:
                         pol = geom.Point(x[0]).buffer(0.5)
                     pols.append(pol)
-                pols = unary_union(pols)
-                coords = np.array(pols.convex_hull.exterior.coords, dtype=np.uint).tolist()
-                seg_struct['boundary'] = coords
+                try:
+                    pols = unary_union(pols)
+                    coords = np.array(pols.convex_hull.exterior.coords, dtype=np.uint).tolist()
+                    seg_struct['boundary'] = coords
+                except Exception as e:
+                    print("pols =", str(pols))
+                    for pol in pols:
+                        print("pol =", str(pol))
             line['recognition'].append(seg_struct)
             char_idx += len(segment)
             seg_idx += 1

With this patch, the log output is

[...]
Writing recognition results for /data1/stweil/digitue/theo/Monographien/feilmoser1830/feilmoser1830_00519.jpg   TopologyException: side location conflict at 129.6338329764454 397.07922912205566. This can occur if the input geometry is invalid.
pols = [<shapely.geometry.polygon.Polygon object at 0x7fa9baeb3850>, <shapely.geometry.polygon.Polygon object at 0x7fa9baeb37f0>, <shapely.geometry.polygon.Polygon object at 0x7fa9baeb38b0>, <shapely.geometry.polyg
on.Polygon object at 0x7fa9baeb38e0>, <shapely.geometry.polygon.Polygon object at 0x7fa9baeb35e0>, <shapely.geometry.polygon.Polygon object at 0x7fa9baeb3880>, <shapely.geometry.polygon.Polygon object at 0x7fa9baeb
3820>]
pol = POLYGON ((74 390, 69 417, 68 443, 78 390, 74 390))
pol = POLYGON ((101 390, 90 445, 94 445, 105 390, 101 390))
pol = POLYGON ((122.49094168581655 390.0947431323506, 122.4978641183102 390.04616621816007, 122.49999184718432 389.99714469805167, 122.49730438124669 389.94815067606186, 122.48982760226345 389.8996559914053, 122.47
763351570339 389.85212767440515, 122.46083955728606 389.80602344873574, 122.43960746201155 389.76178732329333, 122.41414170656398 389.7198453161479, 122.38468754008932 389.6806013517561, 122.351528622312 389.644433
370948, 122.31498429173672 389.6116896911501, 122.27540649024408 389.58268565189843, 122.2331763736979 389.5577005779462, 122.18870064120583 389.53697508921385, 122.14240761838468 389.5207087834875, 122.09474313235
057 389.50905831418345, 122.04616621816007 389.5021358816898, 121.99714469805164 389.50000815281567, 121.94815067606186 389.5026956187533, 121.89965599140533 389.51017239773654, 121.85212767440517 389.5223664842966
4, 121.80602344873571 389.5391604427139, 121.76178732329332 389.56039253798843, 121.7198453161479 389.585858293436, 121.68060135175611 389.6153124599107, 121.64443337094797 389.648471377688, 121.61168969115009 389.
6850157082633, 121.58268565189843 389.7245935097559, 121.55770057794624 389.7668236263021, 121.53697508921387 389.81129935879414, 121.52070878348754 389.8575923816153, 121.50905831418345 389.9052568676494, 110.5090
5831418345 446.9052568676494, 110.5021358816898 446.95383378183993, 110.50000815281568 447.00285530194833, 110.50269561875331 447.05184932393814, 110.51017239773655 447.1003440085947, 110.52236648429661 447.1478723
2559485, 110.53916044271394 447.19397655126426, 110.56039253798845 447.23821267670667, 110.58585829343602 447.2801546838521, 110.61531245991068 447.3193986482439, 110.648471377688 447.355566629052, 110.685015708263
28 447.3883103088499, 110.72459350975592 447.41731434810157, 110.7668236263021 447.4422994220538, 110.81129935879417 447.46302491078615, 110.85759238161532 447.4792912165125, 110.90525686764943 447.49094168581655, 
110.95383378183993 447.4978641183102, 111.00285530194836 447.49999184718433, 111.05184932393814 447.4973043812467, 111.10034400859467 447.48982760226346, 111.14787232559483 447.47763351570336, 111.19397655126429 44
7.4608395572861, 111.23821267670668 447.43960746201157, 111.2801546838521 447.414141706564, 111.31939864824389 447.3846875400893, 111.35556662905203 447.351528622312, 111.38831030884991 447.3149842917367, 111.41731
434810157 447.2754064902441, 111.44229942205376 447.2331763736979, 111.46302491078613 447.18870064120586, 111.47929121651246 447.1424076183847, 111.49094168581655 447.0947431323506, 122.49094168581655 390.094743132
3506))
pol = POLYGON ((131 390, 120 447, 127 448, 130 390, 131 390))
pol = POLYGON ((147.49935488468583 390.02539092634, 147.49943910041867 389.9763233242831, 147.49471344459326 389.9274837415564, 147.48522342785856 389.8793425300399, 147.4710604442635 389.8323633159172, 147.4523608
9108232 389.7869985347017, 147.4293048552344 389.7436850740355, 147.40211437894868 389.7028400662224, 147.37105132137557 389.6648568710156, 147.33641483674037 389.6301012873489, 147.2985384933243 389.59890803049217, 147.25778706101974 389.571577508561, 147.21455299839687 389.54837292942193, 147.1692526731128 389.529517765857, 147.12232235206375 389.51519360339864, 147.0742139998961 389.5055383915617, 147.02539092633998 389.50064511531417, 146.9763233242831 389.50056089958133, 146.92748374155636 389.5052865554067, 146.87934253003988 389.51477657214144, 146.83236331591715 389.52893955573654, 146.78699853470172 389.5476391089177, 146.7436850740355 389.5706951447656, 146.70284006622236 389.5978856210513, 146.66485687101564 389.6289486786244, 146.63010128734888 389.66358516325965, 146.59890803049217 389.7014615066757, 146.571577508561 389.74221293898023, 146.54837292942193 389.78544700160313, 146.529517765857 389.8307473268872, 146.51519360339864 389.87767764793625, 146.50553839156166 389.92578600010387, 146.50064511531417 389.97460907366, 143.50064511531417 448.97460907366, 143.50056089958133 449.0236766757169, 143.50528655540674 449.0725162584436, 143.51477657214144 449.1206574699601, 143.5289395557365 449.1676366840828, 143.54763910891768 449.2130014652983, 143.5706951447656 449.2563149259645, 143.59788562105132 449.2971599337776, 143.62894867862443 449.3351431289844, 143.66358516325963 449.3698987126511, 143.7014615066757 449.40109196950783, 143.74221293898026 449.428422491439, 143.78544700160313 449.45162707057807, 143.8307473268872 449.470482234143, 143.87767764793625 449.48480639660136, 143.9257860001039 449.4944616084383, 143.97460907366002 449.49935488468583, 144.0236766757169 449.49943910041867, 144.07251625844364 449.4947134445933, 144.12065746996012 449.48522342785856, 144.16763668408285 449.47106044426346, 144.21300146529828 449.4523608910823, 144.2563149259645 449.4293048552344, 144.29715993377764 449.4021143789487, 144.33514312898436 449.3710513213756, 144.36989871265112 449.33641483674035, 144.40109196950783 449.2985384933243, 144.428422491439 449.25778706101977, 144.45162707057807 449.21455299839687, 144.470482234143 449.1692526731128, 144.48480639660136 449.12232235206375, 144.49446160843834 449.07421399989613, 144.49935488468583 449.02539092634, 147.49935488468583 390.02539092634))
pol = POLYGON ((160.49937616943893 390.02496880847195, 160.49941890789353 389.97590115275733, 160.494651969255 389.92706558211546, 160.48512126174825 389.87893240978786, 160.47091857129664 389.8319651845362, 160.45218067757312 389.7866162264147, 160.42908803673748 389.7433222706802, 160.4018630435447 389.70250026179343, 160.3707678895619 389.66454333801545, 160.33610203812026 389.62981704527164, 160.29819934031946 389.5986558167445, 160.25742481985904 389.5713597520991, 160.21417115766064 389.5481917273596, 160.16885491013613 389.52937486326914, 160.12191249752152 389.51509037651533, 160.0737960009116 389.5054758345142, 160.02496880847195 389.5006238305611, 159.97590115275736 389.5005810921065, 159.9270655821155 389.505348030745, 159.87893240978784 389.51487873825175, 159.83196518453622 389.52908142870336, 159.78661622641468 389.5478193224269, 159.7433222706802 389.5709119632625, 159.7025002617934 389.5981369564553, 159.66454333801545 389.6292321104381, 159.62981704527164 389.66389796187974, 159.59865581674447 389.70180065968054, 159.5713597520991 389.742575180141, 159.5481917273596 389.78582884233936, 159.52937486326917 389.83114508986387, 159.5150903765153 389.8780875024785, 159.5054758345142 389.9262039990884, 159.50062383056107 389.97503119152805, 156.50062383056107 449.97503119152805, 156.50058109210647 450.02409884724267, 156.505348030745 450.07293441788454, 156.51487873825175 450.12106759021214, 156.52908142870336 450.1680348154638, 156.54781932242688 450.2133837735853, 156.57091196326252 450.2566777293198, 156.5981369564553 450.29749973820657, 156.6292321104381 450.33545666198455, 156.66389796187974 450.37018295472836, 156.70180065968054 450.4013441832555, 156.74257518014096 450.4286402479009, 156.78582884233936 450.4518082726404, 156.83114508986387 450.47062513673086, 156.87808750247848 450.48490962348467, 156.9262039990884 450.4945241654858, 156.97503119152805 450.4993761694389, 157.02409884724264 450.4994189078935, 157.0729344178845 450.494651969255, 157.12106759021216 450.48512126174825, 157.16803481546378 450.47091857129664, 157.21338377358532 450.4521806775731, 157.2566777293198 450.4290880367375, 157.2974997382066 450.4018630435447, 157.33545666198455 450.3707678895619, 157.37018295472836 450.33610203812026, 157.40134418325553 450.29819934031946, 157.4286402479009 450.257424819859, 157.4518082726404 450.21417115766064, 157.47062513673083 450.16885491013613, 157.4849096234847 450.1219124975215, 157.4945241654858 450.0737960009116, 157.49937616943893 450.02496880847195, 160.49937616943893 390.02496880847195))
pol = POLYGON ((176.49968142717378 393.0178457652562, 176.4990245154025 392.9687824884514, 176.49356172475336 392.920019853348, 176.48334566488657 392.87202747077845, 176.4684747220433 392.82526753362225, 176.44909211153242 392.78019036563575, 176.42538449848877 392.7372300845883, 176.39758020018584 392.6968004214709, 176.36594698721566 392.6592907360406, 176.33078950471165 392.62506226707274, 176.2924463384493 392.59444465343483, 176.2512867540803 392.5677327594834, 176.20770714090216 392.5451838353595, 176.16212719441285 392.5270150395289, 176.11498587441363 392.51340134742753, 176.0667371775861 392.50447386635227, 176.0178457652562 392.50031857282625, 175.96878248845138 392.50097548459746, 175.920019853348 392.50643827524664, 175.87202747077848 392.5166543351134, 175.82526753362225 392.5315252779567, 175.78019036563575 392.5509078884676, 175.7372300845883 392.57461550151123, 175.69680042147095 392.6024197998142, 175.65929073604056 392.6340530127843, 175.62506226707276 392.6692104952883, 175.59444465343483 392.70755366155066, 175.56773275948345 392.7487132459197, 175.54518383535955 392.79229285909787, 175.52701503952892 392.83787280558715, 175.51340134742753 392.8850141255864, 175.5044738663523 392.9332628224139, 175.50031857282622 392.9821542347438, 173.50031857282622 448.9821542347438, 173.5009754845975 449.0312175115486, 173.50643827524664 449.079980146652, 173.51665433511343 449.12797252922155, 173.5315252779567 449.17473246637775, 173.55090788846758 449.21980963436425, 173.57461550151123 449.2627699154117, 173.60241979981416 449.3031995785291, 173.63405301278434 449.3407092639594, 173.66921049528835 449.37493773292726, 173.7075536615507 449.40555534656517, 173.7487132459197 449.4322672405166, 173.79229285909784 449.4548161646405, 173.83787280558715 449.4729849604711, 173.88501412558637 449.48659865257247, 173.9332628224139 449.49552613364773, 173.9821542347438 449.49968142717375, 174.03121751154862 449.49902451540254, 174.079980146652 449.49356172475336, 174.12797252922152 449.4833456648866, 174.17473246637775 449.4684747220433, 174.21980963436425 449.4490921115324, 174.2627699154117 449.42538449848877, 174.30319957852905 449.3975802001858, 174.34070926395944 449.3659469872157, 174.37493773292724 449.3307895047117, 174.40555534656517 449.29244633844934, 174.43226724051655 449.2512867540803, 174.45481616464045 449.20770714090213, 174.47298496047108 449.16212719441285, 174.48659865257247 449.1149858744136, 174.4955261336477 449.0667371775861, 174.49968142717378 449.0178457652562, 176.49968142717378 393.0178457652562))

So unary_union fails while processing an array of 7 polygons. Because of the patch kraken now writes an ALTO file which includes that polygons to describe the glyphs of the word wachſen in the scanned image:

<String ID="segment_66" CONTENT="wachſen" HPOS="68" VPOS="390" WIDTH="108" HEIGHT="60" WC="1.0">
<Glyph ID="char_265" CONTENT="w" HPOS="68" VPOS="390" WIDTH="10" HEIGHT="53" GC="1.0">
<Shape>
<Polygon POINTS="74 390 69 417 68 443 78 390"/>
</Shape>
</Glyph>
<Glyph ID="char_266" CONTENT="a" HPOS="90" VPOS="390" WIDTH="15" HEIGHT="55" GC="1.0">
<Shape>
<Polygon POINTS="101 390 90 445 94 445 105 390"/>
</Shape>
</Glyph>
<Glyph ID="char_267" CONTENT="c" HPOS="111" VPOS="390" WIDTH="11" HEIGHT="57" GC="1.0">
<Shape>
<Polygon POINTS="122 390 111 447 111 447 122 390"/>
</Shape>
</Glyph>
<Glyph ID="char_268" CONTENT="h" HPOS="120" VPOS="390" WIDTH="11" HEIGHT="58" GC="1.0">
<Shape>
<Polygon POINTS="131 390 120 447 127 448 130 390"/>
</Shape>
</Glyph>
<Glyph ID="char_269" CONTENT="ſ" HPOS="144" VPOS="390" WIDTH="3" HEIGHT="59" GC="1.0">
<Shape>
<Polygon POINTS="147 390 144 449 144 449 147 390"/>
</Shape>
</Glyph>
<Glyph ID="char_270" CONTENT="e" HPOS="157" VPOS="390" WIDTH="3" HEIGHT="60" GC="1.0">
<Shape>
<Polygon POINTS="160 390 157 450 157 450 160 390"/>
</Shape>
</Glyph>
<Glyph ID="char_271" CONTENT="n" HPOS="174" VPOS="393" WIDTH="2" HEIGHT="56" GC="1.0">
<Shape>
<Polygon POINTS="176 393 174 449 174 449 176 393"/>
</Shape>
</Glyph>
</String>

stweil avatar Aug 08 '22 19:08 stweil

Urrrgh another shapely/GEOM bug. I'll look into it. In fact the code just above your instrumentation is there to circumvent TopologyExceptions caused by corner cases in the unary_union function. Apparently, we haven't caught them all yet. Why can't geometry be simple?

mittagessen avatar Aug 08 '22 21:08 mittagessen

Could it be that each error also leaks GPU memory? I have three running kraken processes which process different large sets of journal pages. Currently they use 5747 MiB, 5225 MiB and 8151 MiB of GPU memory. The process which uses most memory happens to be the one with most errors.

stweil avatar Aug 09 '22 11:08 stweil

It shouldn't really. This is in the serializer. But pytorch/cuda does some weird caching which is most likely the reason for increasing memory usage over time. In my experience it gets released after a while and doesn't cause any trouble.

mittagessen avatar Aug 10 '22 09:08 mittagessen

For these images segmentation (kraken -a -I kath_1902_025_00156.jpg -o .xml segment -bl) fails without a more detailled error message: https://ub-backup.bib.uni-mannheim.de/~stweil/digitue/theo/Zeitschriften/kath_1867_017/kath_1867_017_00434.jpg https://ub-backup.bib.uni-mannheim.de/~stweil/digitue/theo/Zeitschriften/kath_1902_025/kath_1902_025_00156.jpg

stweil avatar Aug 24 '22 14:08 stweil

Sorry, accidental auto-close. I've pushed a fix for the error in your latest message (again geometry weirdness but this time in the region vectorization itself).

mittagessen avatar Aug 24 '22 16:08 mittagessen

@stweil Could you tell me which model you used to get the serializer errors on feilmoser1830_00519.jpg? I've failed to reproduce it with your reichsanzeiger_23.mlmodel and the current blla.mlmodel but I vaguely remember it crashing on my system before the holidays.

mittagessen avatar Aug 25 '22 00:08 mittagessen

@mittagessen, please try it with my digitue_best.mlmodel. The above output was produced with an earlier model of the same series.

stweil avatar Aug 29 '22 05:08 stweil

Same with that one. No crash.

mittagessen avatar Aug 29 '22 10:08 mittagessen

I got the error messages with kraken-4.1.3.dev37 and now updated to kraken-4.1.3.dev48. It still fails during the serialization:

kraken -a -I feilmoser1830/feilmoser1830_00519.jpg -o .xml segment -bl ocr -m digitue_best.mlmodel
Loading ANN /home/stweil/src/github/mittagessen/kraken/kraken/blla.mlmodel	✓
Loading ANN digitue_best.mlmodel	✓
Segmenting	✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 37/37 0:00:00 0:00:21
Writing recognition results for feilmoser1830/feilmoser1830_00519.jpg	TopologyException: side location conflict at 129.6338329764454 397.07922912205566. This can occur if the input geometry is invalid.
[08/29/22 12:26:29] ERROR    Failed processing feilmoser1830/feilmoser1830_00519.jpg: No Shapely geometry can be created from null value 

stweil avatar Aug 29 '22 10:08 stweil

Same with that one. No crash.

Did you create text output? That works for me, too, without any error. Try to produce ALTO XML, PAGE XML or hOCR. Those fail for me.

stweil avatar Aug 29 '22 10:08 stweil

With the latest master and running:

kraken -a -i feilmoser1830_00519.jpg out.xml segment -bl ocr -m Downloads/digitue_best.mlmodel
Loading ANN /home/mittagessen/git/kraken/kraken/blla.mlmodel	✓
Loading ANN Downloads/digitue_best.mlmodel	✓
Segmenting	✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 37/37 0:00:00 0:00:17
Writing recognition results for feilmoser1830_00519.jpg	✓

it works for me. Which shapely version do you have installed (1.7.1 here)? And is it a conda install or pip?

mittagessen avatar Aug 29 '22 10:08 mittagessen

I use a pip install with Python 3.9. That installed Shapely-1.8.2 by default.

stweil avatar Aug 29 '22 10:08 stweil

OK, then I'll see if I can reproduce that environment (and the error). If it is the same error (you can actually get a full stack trace by with the --raise-on-error switch) as your instrumentation showed and shapely related it is entirely possible that it is a slight difference in the underlying GEOM's library behavior.

mittagessen avatar Aug 29 '22 10:08 mittagessen

After a downgrade to Shapely-1.7.1 it works for me, too.

stweil avatar Aug 29 '22 10:08 stweil

That's already very helpful to know.

mittagessen avatar Aug 29 '22 10:08 mittagessen

Latest version is Shapely-1.8.4. I tested that now, and it also fails.

stweil avatar Aug 29 '22 10:08 stweil

I can reproduce your bug with 1.8.2 and another unrelated TopologyException for 1.8.4. Awesome.

mittagessen avatar Aug 29 '22 11:08 mittagessen

I pinned shapely to 1.7.x until we get a handle on things. 1.8/2.0 is a fairly large rewrite on their side so kicking the can down the road until things have stabilized is probably easier than playing whack-a-mole with a moving target. I'll write some additional tests that should trigger these regressions though to make sure it won't happen again.

mittagessen avatar Aug 29 '22 11:08 mittagessen

Shapely is pinned to 1.8.x now.

When switching to shapely 2.0.1, I do get self-intersections again.

bertsky avatar May 25 '23 10:05 bertsky

I just got the error with latest kraken, Shapely 1.8.5.post1 and images from https://digi.bib.uni-mannheim.de/fileadmin/digi/517438313/max/:

# models available from https://ub-backup.bib.uni-mannheim.de/~stweil/tesstrain/kraken/
kraken --alto -o .xml --batch-input "max/*.jpg" segment --model ubma_segmentation.mlmodel --baseline ocr --model desbillons.mlmodel
scikit-learn version 1.2.2 is not supported. Minimum required version: 0.17. Maximum required version: 1.1.2. Disabling scikit-learn conversion API.
Torch version 2.0.1+cu117 has not been tested with coremltools. You may run into unexpected errors. Torch 2.0.0 is the most recent version that has been tested.
Loading ANN ubma_segmentation.mlmodel   ✓
Loading ANN desbillons.mlmodel  ✓
Segmenting      TopologyException: side location conflict at 751.45901639344265 1090.5081967213114. This can occur if the input geometry is invalid.
[07/07/23 07:26:19] WARNING  Polygonizer failed on line 0:   segmentation.py:735
                             No Shapely geometry can be                         
                             created from null value                            
TopologyException: side location conflict at 1196.8470588235293 731.42352941176466. This can occur if the input geometry is invalid.
[07/07/23 07:26:20] WARNING  Polygonizer failed on line 0:   segmentation.py:735
                             No Shapely geometry can be                         
                             created from null value                            
TopologyException: side location conflict at 717.9970261697066 1091.8542823156224. This can occur if the input geometry is invalid.
[07/07/23 07:26:22] WARNING  Polygonizer failed on line 0:   segmentation.py:735
                             No Shapely geometry can be                         
                             created from null value                            
TopologyException: side location conflict at 820.10526315789468 592.28070175438597. This can occur if the input geometry is invalid.
[07/07/23 07:26:23] WARNING  Polygonizer failed on line 0:   segmentation.py:735
                             No Shapely geometry can be                         
                             created from null value                            
✓
Processing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 117/117 0:00:00 0:00:32
Writing recognition results for max/1663547289_0057.jpg ✓

stweil avatar Jul 07 '23 05:07 stweil