hOCR-to-ALTO icon indicating copy to clipboard operation
hOCR-to-ALTO copied to clipboard

Convert between Tesseract hOCR and ALTO XML using XSL stylesheets

hOCR-to-ALTO

Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets

The XSLT scripts use XSLT 2.0 features, so they require an XSLT 2.0 capable transformer, like Saxon.

Running the conversion using Saxon-HE command line - example converting ALTO to hOCR:

 > java -jar saxon-he.jar -s:input-alto.xml -xsl:alto__hocr.xsl -o:output-hocr.xml

See ocr-fileformat for an interface to using these stylesheets.

hOCR-spec https://github.com/kba/hocr-spec

File naming scheme: sourceFormatVersion__targetFormatVersion.xsl

CONTENTS

  • Convert ALTO to hOCR
    • alto__hocr.xsl
  • Convert hOCR to ALTO
    • hocr__alto4.xsl
    • hocr__alto3.xsl
    • hocr__alto2.1.xsl
    • hocr__alto2.0.xsl
  • Convert ALTO to plain text
    • alto__text.xsl
  • Convert hOCR to plain text
    • hocr__text.xsl
  • Language codes
    • codes_lookup.xml - generated with https://github.com/filak/iso-language-codes