ocrd-website
ocrd-website copied to clipboard
docs/ocrd-training: export from OCR-D toolchain
I am not sure I have a good grasp of what is ultimately intended by docs/ocrd-training.md, but as it stands, I think the page should at least link to (or better describe) the 2 very options we currently have to extract line images and respective metadata from PAGE-XML annotations:
- page2img.py: minimal dependencies (no
ocrd, onlylxml), but also minimal functionality - ocrd-segment-extract-lines: normal OCR-D processor, capable of utilising/respecting all information the workflow provides...
AlternativeImageanywhere along the hierarchy (e.g. binarization or dewarping)@orientationon page or region level (i.e. cropping the minimal bounding box after deskewing)Coords/@pointsas polygon not just bounding box (masking the pixels outside; optionally with alpha channel)- provide line text (
.gt.txt) and meta-data (IDs among PAGE hierarchy and METS, script/language features etc, region@type, page@type, image preprocessing features, image DPI value)
That page was supposed to provide a "running start" for @Doreenruirui when she started working on what would become okralact. It is true though that we should provide an actual guide on training and your suggestions are welcome.
Understood.
Another thing that this page or guide should mention is converters for page segmentation training data. With ocrd-segment-from-masks and ocrd-segment-from-coco we have 2 importers and with the debug images and coco output of ocrd-segment-extract-pages we have 2 exporters for commonly used non-PAGE formats.
Can perhaps be closed – there's a section on the ocrd_segment converters in https://ocr-d.de/en/workflows#step-19-format-conversion now. (And page2img is independent of OCR-D and most OCR tools: tesstrain will probably include its own PAGE converter and Calamari already does. If you do mention it somewhere, then please don't forget https://github.com/uniwue-zpd/PAGETools, too.)
I think these are now adressed and the originally referenced page removed, so closing.