ocrd_anybaseocr icon indicating copy to clipboard operation
ocrd_anybaseocr copied to clipboard

layout-analysis: (re)train

Open bertsky opened this issue 7 months ago • 0 comments

Since there is no documentation here for the training process and the training data, we have to make guesses.

The current model for (logical / whole-page) layout-analysis contains 21 classes:

['annotation', 'binding', 'chapter', 'colour_checker', 'contained_work', 'contents', 'cover', 'edge', 'endsheet', 'epicedia', 'illustration', 'index', 'musical_notation', 'page', 'paste_down', 'preface', 'provenance', 'section', 'sermon', 'table', 'title_page']

This is clearly inadequate: it mixes very specialised, rare types (sermon) with coarse, frequent ones (page), also it is very unlikely that such fine differentiation is feasible just from the visual classification of pages, independent of each other (i.e. without sequence context). For example, how could the hierarchy levels chapter and section be discernable, reliably?

So IMO we should re-train this on a coarser set of types, say:

  • empty (covering all non-text divs like binding, colour_checker, cover, endsheet)
  • title_page
  • contents (also including index)
  • page.

Perhaps additionally discerning table, illustration and musical_notation pages is doable, but that may well be considered part of physical / structural layout analysis (as these region types rarely occur alone on a page).

Going back in the history, it is evident that the model has been trained on (an older version of) keras.applications.InceptionV3:

https://github.com/OCR-D/ocrd_anybaseocr/blob/3e897af5fde12a3b1a2cd701c3d66e1f9cc74e78/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py#L62-L66

https://github.com/OCR-D/ocrd_anybaseocr/blob/3e897af5fde12a3b1a2cd701c3d66e1f9cc74e78/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py#L73

https://github.com/OCR-D/ocrd_anybaseocr/blob/3e897af5fde12a3b1a2cd701c3d66e1f9cc74e78/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py#L81-L82

https://github.com/OCR-D/ocrd_anybaseocr/blob/3e897af5fde12a3b1a2cd701c3d66e1f9cc74e78/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py#L161-L165

https://github.com/OCR-D/ocrd_anybaseocr/blob/3e897af5fde12a3b1a2cd701c3d66e1f9cc74e78/ocrd_anybaseocr/cli/ocrd_anybaseocr_layout_analysis.py#L85-L88

So input seems to be 600x500px grayscale (1-channel), with a batch dimension in front.

It would help to know what training data was previously used, though.

@n00blet could you please comment?

bertsky avatar May 07 '25 14:05 bertsky