kraken
kraken copied to clipboard
Training with --repolygonize raises ValueError exception
Training similar to a previous training but with the additional option --repolygonize
fails. The training process aborts after spending much time with polygonizing.
(venv) stweil@ocr-02:~/src/github/ubtue/gt-fraktur$ time nice ketos train -f page -t list.train -o repolygonize/frak-ubtue -d cuda:0 --preload --threads 24 --lag 20 -r 0.0001 -B 1 -
w 0 -s '[1,120,0,1 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,13,32 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 Mp2,2 Cr3,9,64 Do0.1,2 S1(1x0)1,3 Lbx200 Do0.1,2 Lbx200 Do.1,2 Lbx200 Do]' --repolygonize
WARNING:root:Torch version 1.11.0.dev20211217+cu113 has not been tested with coremltools. You may run into unexpected errors. Torch 1.9.1 is the most recent version that has been tested.
Repolygonizing data
[222.1904] Polygonizer failed on line 0: LineStrings must have at least 2 coordinate tuples
[222.1965] Polygonizer failed on line 1: LineStrings must have at least 2 coordinate tuples
[222.2026] Polygonizer failed on line 2: LineStrings must have at least 2 coordinate tuples
[222.2088] Polygonizer failed on line 3: LineStrings must have at least 2 coordinate tuples
[222.2149] Polygonizer failed on line 4: LineStrings must have at least 2 coordinate tuples
[222.2210] Polygonizer failed on line 5: LineStrings must have at least 2 coordinate tuples
[222.2269] Polygonizer failed on line 6: LineStrings must have at least 2 coordinate tuples
[222.2330] Polygonizer failed on line 7: LineStrings must have at least 2 coordinate tuples
[222.2389] Polygonizer failed on line 8: LineStrings must have at least 2 coordinate tuples
[222.2447] Polygonizer failed on line 9: LineStrings must have at least 2 coordinate tuples
[222.2509] Polygonizer failed on line 10: LineStrings must have at least 2 coordinate tuples
[222.2569] Polygonizer failed on line 11: LineStrings must have at least 2 coordinate tuples
[222.2630] Polygonizer failed on line 12: LineStrings must have at least 2 coordinate tuples
[222.2692] Polygonizer failed on line 13: LineStrings must have at least 2 coordinate tuples
[222.2757] Polygonizer failed on line 14: LineStrings must have at least 2 coordinate tuples
[222.2817] Polygonizer failed on line 15: LineStrings must have at least 2 coordinate tuples
[222.2878] Polygonizer failed on line 16: LineStrings must have at least 2 coordinate tuples
[222.2941] Polygonizer failed on line 17: LineStrings must have at least 2 coordinate tuples
[222.3002] Polygonizer failed on line 18: LineStrings must have at least 2 coordinate tuples
[222.3063] Polygonizer failed on line 19: LineStrings must have at least 2 coordinate tuples
[222.3124] Polygonizer failed on line 20: LineStrings must have at least 2 coordinate tuples
[222.3185] Polygonizer failed on line 21: LineStrings must have at least 2 coordinate tuples
[222.3244] Polygonizer failed on line 22: LineStrings must have at least 2 coordinate tuples
[222.3303] Polygonizer failed on line 23: LineStrings must have at least 2 coordinate tuples
[222.3361] Polygonizer failed on line 24: LineStrings must have at least 2 coordinate tuples
[222.3420] Polygonizer failed on line 25: LineStrings must have at least 2 coordinate tuples
[222.3421] Polygonizer failed on line 26: LineStrings must have at least 2 coordinate tuples
[222.3479] Polygonizer failed on line 27: LineStrings must have at least 2 coordinate tuples
[222.3539] Polygonizer failed on line 28: LineStrings must have at least 2 coordinate tuples
[222.3598] Polygonizer failed on line 29: LineStrings must have at least 2 coordinate tuples
[222.3660] Polygonizer failed on line 30: LineStrings must have at least 2 coordinate tuples
[222.3661] Polygonizer failed on line 31: LineStrings must have at least 2 coordinate tuples
[222.3721] Polygonizer failed on line 32: LineStrings must have at least 2 coordinate tuples
[222.3781] Polygonizer failed on line 33: LineStrings must have at least 2 coordinate tuples
[222.3841] Polygonizer failed on line 34: LineStrings must have at least 2 coordinate tuples
[222.3903] Polygonizer failed on line 35: LineStrings must have at least 2 coordinate tuples
[222.3962] Polygonizer failed on line 36: LineStrings must have at least 2 coordinate tuples
[222.4022] Polygonizer failed on line 37: LineStrings must have at least 2 coordinate tuples
[222.4082] Polygonizer failed on line 38: LineStrings must have at least 2 coordinate tuples
[222.4142] Polygonizer failed on line 39: LineStrings must have at least 2 coordinate tuples
[222.4201] Polygonizer failed on line 40: LineStrings must have at least 2 coordinate tuples
[222.4262] Polygonizer failed on line 41: LineStrings must have at least 2 coordinate tuples
[222.4320] Polygonizer failed on line 42: LineStrings must have at least 2 coordinate tuples
[222.4373] Polygonizer failed on line 43: LineStrings must have at least 2 coordinate tuples
[222.4429] Polygonizer failed on line 44: LineStrings must have at least 2 coordinate tuples
[222.4484] Polygonizer failed on line 45: LineStrings must have at least 2 coordinate tuples
[222.4539] Polygonizer failed on line 46: LineStrings must have at least 2 coordinate tuples
[222.4596] Polygonizer failed on line 47: LineStrings must have at least 2 coordinate tuples
[222.4652] Polygonizer failed on line 48: LineStrings must have at least 2 coordinate tuples
[222.4708] Polygonizer failed on line 49: LineStrings must have at least 2 coordinate tuples
[222.4766] Polygonizer failed on line 50: LineStrings must have at least 2 coordinate tuples
[222.4824] Polygonizer failed on line 51: LineStrings must have at least 2 coordinate tuples
[222.4878] Polygonizer failed on line 52: LineStrings must have at least 2 coordinate tuples
[222.4934] Polygonizer failed on line 53: LineStrings must have at least 2 coordinate tuples
[222.4990] Polygonizer failed on line 54: LineStrings must have at least 2 coordinate tuples
[222.5047] Polygonizer failed on line 55: LineStrings must have at least 2 coordinate tuples
[222.5104] Polygonizer failed on line 56: LineStrings must have at least 2 coordinate tuples
[222.5159] Polygonizer failed on line 57: LineStrings must have at least 2 coordinate tuples
[222.5215] Polygonizer failed on line 58: LineStrings must have at least 2 coordinate tuples
[222.5273] Polygonizer failed on line 59: LineStrings must have at least 2 coordinate tuples
[222.5329] Polygonizer failed on line 60: LineStrings must have at least 2 coordinate tuples
[222.5386] Polygonizer failed on line 61: LineStrings must have at least 2 coordinate tuples
[222.5441] Polygonizer failed on line 62: LineStrings must have at least 2 coordinate tuples
[222.5496] Polygonizer failed on line 63: LineStrings must have at least 2 coordinate tuples
[222.5552] Polygonizer failed on line 64: LineStrings must have at least 2 coordinate tuples
[222.5609] Polygonizer failed on line 65: LineStrings must have at least 2 coordinate tuples
[222.5664] Polygonizer failed on line 66: LineStrings must have at least 2 coordinate tuples
[222.5721] Polygonizer failed on line 67: LineStrings must have at least 2 coordinate tuples
[222.5776] Polygonizer failed on line 68: LineStrings must have at least 2 coordinate tuples
[222.5834] Polygonizer failed on line 69: LineStrings must have at least 2 coordinate tuples
[222.5889] Polygonizer failed on line 70: LineStrings must have at least 2 coordinate tuples
[222.5945] Polygonizer failed on line 71: LineStrings must have at least 2 coordinate tuples
[222.6002] Polygonizer failed on line 72: LineStrings must have at least 2 coordinate tuples
[222.6060] Polygonizer failed on line 73: LineStrings must have at least 2 coordinate tuples
[222.6103] Polygonizer failed on line 74: LineStrings must have at least 2 coordinate tuples
[222.6131] Polygonizer failed on line 75: LineStrings must have at least 2 coordinate tuples
[222.6185] Polygonizer failed on line 76: LineStrings must have at least 2 coordinate tuples
[222.6239] Polygonizer failed on line 77: LineStrings must have at least 2 coordinate tuples
[222.6279] Polygonizer failed on line 78: LineStrings must have at least 2 coordinate tuples
[222.6328] Polygonizer failed on line 79: LineStrings must have at least 2 coordinate tuples
[222.6349] Polygonizer failed on line 80: LineStrings must have at least 2 coordinate tuples
[222.6419] Polygonizer failed on line 81: LineStrings must have at least 2 coordinate tuples
[222.6457] Polygonizer failed on line 82: LineStrings must have at least 2 coordinate tuples
[398.8291] Polygonizer failed on line 130: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
[2742.1722] Polygonizer failed on line 36: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
[3519.8444] Polygonizer failed on line 22: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (8,) + inhomogeneous part.
[3620.7844] Polygonizer failed on line 36: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
[3620.7855] Polygonizer failed on line 37: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.
[3624.1895] Polygonizer failed on line 44: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
[4287.1528] Polygonizer failed on line 130: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (2,) + inhomogeneous part.
[4805.4429] Polygonizer failed on line 10: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (14,) + inhomogeneous part.
[5029.8292] Polygonizer failed on line 30: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5034.2293] Polygonizer failed on line 38: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5224.3756] Polygonizer failed on line 7: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5224.4206] Polygonizer failed on line 8: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5225.2441] Polygonizer failed on line 11: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5226.6153] Polygonizer failed on line 13: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5231.2774] Polygonizer failed on line 24: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5231.3316] Polygonizer failed on line 25: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5231.3779] Polygonizer failed on line 26: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5231.9679] Polygonizer failed on line 28: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5234.4476] Polygonizer failed on line 35: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5236.1580] Polygonizer failed on line 40: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5236.1973] Polygonizer failed on line 41: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
[5236.2531] Polygonizer failed on line 42: No intersection with boundaries. Shapely intersection object: LINESTRING EMPTY
Building training set [------------------------------------] 0/12892multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
return list(map(*args))
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/kraken/lib/train.py", line 50, in _star_fun
return fun(**kwargs)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/kraken/lib/dataset.py", line 493, in parse
if not boundary:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/bin/ketos", line 8, in <module>
sys.exit(cli())
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
return f(get_current_context(), *args, **kwargs)
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/kraken/ketos.py", line 569, in train
trainer = KrakenTrainer.recognition_train_gen(hyper_params,
File "/home/stweil/src/github/OCR-D/ocrd_all/venv/lib/python3.9/site-packages/kraken/lib/train.py", line 679, in recognition_train_gen
for im in pool.imap_unordered(partial(_star_fun, gt_set.parse), training_data, 5):
File "/usr/lib/python3.9/multiprocessing/pool.py", line 448, in <genexpr>
return (item for chunk in result for item in chunk)
File "/usr/lib/python3.9/multiprocessing/pool.py", line 870, in next
raise value
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
real 110m28.678s
user 553m21.429s
sys 1440m42.823s
The ground truth data was produced at UB Tübingen with Transkribus. The Transkribus PAGE export typically has rather bad line boxes, but usually the baselines are better. Therefore I wanted to try a 2nd training with --repolygonize
to see whether that works better. The log output shows a lot of problems with the data from the PAGE files and include line numbers, but not the filenames. Maybe that can be improved, too.
On 21/12/18 08:20AM, Stefan Weil wrote:
The ground truth data was produced at UB Tübingen with Transkribus. The Transkribus PAGE export typically has rather bad line boxes, but usually the baselines are better. Therefore I wanted to try a 2nd training with
--repolygonize
to see whether that works better. The log output shows a lot of problems with the data from the PAGE files and include line numbers, but not the filenames. Maybe that can be improved, too.
Hm sorry about that. A better way to repolygonize is to use the repolygonize.py script in contrib/ as that writes a new XML file. In any case, the transkribus exports are often a bit problematic as their baselines are floating below the actual baseline so the repolygonizing can sometimes produce completely flat polygons. There's already an auto-offset applied but at times it isn't enough. So better to verify the result of the repolygonization with a viewer or the line extraction script (contrib/extract_lines.py).
Apart from that, I'm currently rewriting the training code and introduced a binary dataset format which vastly accelerates training (100% GPU utilization without loader processes). It isn't finished yet but the basics are working so if you want check out the feature/binary_deteet branch and run:
$ ketos compile -f xml --workers 8 -o gt-fraktur **/*.xml
and run ketos train like usual, just with the -f binary
flag and the
binary file(s) instead. You can skip the loader thread argument, it
usually slows things down.
Curiously, I've noticed that I get far fewer segmentations errors, TopologyException
and Polygonizer failed on line X
, when I perform inference on the GPU.
Would it be possible to pad the images in order to run inference on GPU in batch? It would be a lot faster...
The --repolygonize
option has now been removed in main so the issue has become obsolete.