ocrd-website docs/ocrd-training: export from OCR-D toolchain

I am not sure I have a good grasp of what is ultimately intended by docs/ocrd-training.md, but as it stands, I think the page should at least link to (or better describe) the 2 very options we currently have to extract line images and respective metadata from PAGE-XML annotations:

page2img.py: minimal dependencies (no ocrd, only lxml), but also minimal functionality
ocrd-segment-extract-lines: normal OCR-D processor, capable of utilising/respecting all information the workflow provides...
- AlternativeImage anywhere along the hierarchy (e.g. binarization or dewarping)
- @orientation on page or region level (i.e. cropping the minimal bounding box after deskewing)
- Coords/@points as polygon not just bounding box (masking the pixels outside; optionally with alpha channel)
- provide line text (.gt.txt) and meta-data (IDs among PAGE hierarchy and METS, script/language features etc, region @type, page @type, image preprocessing features, image DPI value)

Mar 28 '20 01:03 bertsky

That page was supposed to provide a "running start" for @Doreenruirui when she started working on what would become okralact. It is true though that we should provide an actual guide on training and your suggestions are welcome.

May 11 '20 12:05 kba

Understood.

Another thing that this page or guide should mention is converters for page segmentation training data. With ocrd-segment-from-masks and ocrd-segment-from-coco we have 2 importers and with the debug images and coco output of ocrd-segment-extract-pages we have 2 exporters for commonly used non-PAGE formats.

May 11 '20 12:05 bertsky

Can perhaps be closed – there's a section on the ocrd_segment converters in https://ocr-d.de/en/workflows#step-19-format-conversion now. (And page2img is independent of OCR-D and most OCR tools: tesstrain will probably include its own PAGE converter and Calamari already does. If you do mention it somewhere, then please don't forget https://github.com/uniwue-zpd/PAGETools, too.)

Aug 26 '21 11:08 bertsky

I think these are now adressed and the originally referenced page removed, so closing.

Apr 25 '23 12:04 kba

ocrd-website ocrd-website copied to clipboard

docs/ocrd-training: export from OCR-D toolchain

ocrd-website
ocrd-website copied to clipboard