ocrd-website icon indicating copy to clipboard operation
ocrd-website copied to clipboard

docs/ocrd-training: export from OCR-D toolchain

Open bertsky opened this issue 5 years ago • 3 comments

I am not sure I have a good grasp of what is ultimately intended by docs/ocrd-training.md, but as it stands, I think the page should at least link to (or better describe) the 2 very options we currently have to extract line images and respective metadata from PAGE-XML annotations:

  • page2img.py: minimal dependencies (no ocrd, only lxml), but also minimal functionality
  • ocrd-segment-extract-lines: normal OCR-D processor, capable of utilising/respecting all information the workflow provides...
    • AlternativeImage anywhere along the hierarchy (e.g. binarization or dewarping)
    • @orientation on page or region level (i.e. cropping the minimal bounding box after deskewing)
    • Coords/@points as polygon not just bounding box (masking the pixels outside; optionally with alpha channel)
    • provide line text (.gt.txt) and meta-data (IDs among PAGE hierarchy and METS, script/language features etc, region @type, page @type, image preprocessing features, image DPI value)

bertsky avatar Mar 28 '20 01:03 bertsky

That page was supposed to provide a "running start" for @Doreenruirui when she started working on what would become okralact. It is true though that we should provide an actual guide on training and your suggestions are welcome.

kba avatar May 11 '20 12:05 kba

Understood.

Another thing that this page or guide should mention is converters for page segmentation training data. With ocrd-segment-from-masks and ocrd-segment-from-coco we have 2 importers and with the debug images and coco output of ocrd-segment-extract-pages we have 2 exporters for commonly used non-PAGE formats.

bertsky avatar May 11 '20 12:05 bertsky

Can perhaps be closed – there's a section on the ocrd_segment converters in https://ocr-d.de/en/workflows#step-19-format-conversion now. (And page2img is independent of OCR-D and most OCR tools: tesstrain will probably include its own PAGE converter and Calamari already does. If you do mention it somewhere, then please don't forget https://github.com/uniwue-zpd/PAGETools, too.)

bertsky avatar Aug 26 '21 11:08 bertsky

I think these are now adressed and the originally referenced page removed, so closing.

kba avatar Apr 25 '23 12:04 kba