hOCR-to-ALTO
hOCR-to-ALTO copied to clipboard
Convert between Tesseract hOCR and ALTO XML using XSL stylesheets
hOCR-to-ALTO
Convert between Tesseract hOCR and ALTO XML 2.0/2.1/3/4 using XSL stylesheets
The XSLT scripts use XSLT 2.0 features, so they require an XSLT 2.0 capable transformer, like Saxon.
Running the conversion using Saxon-HE command line - example converting ALTO to hOCR:
> java -jar saxon-he.jar -s:input-alto.xml -xsl:alto__hocr.xsl -o:output-hocr.xml
See ocr-fileformat for an interface to using these stylesheets.
hOCR-spec https://github.com/kba/hocr-spec
File naming scheme: sourceFormatVersion__targetFormatVersion.xsl
CONTENTS
- Convert ALTO to hOCR
-
alto__hocr.xsl
-
- Convert hOCR to ALTO
-
hocr__alto4.xsl
-
hocr__alto3.xsl
-
hocr__alto2.1.xsl
-
hocr__alto2.0.xsl
-
- Convert ALTO to plain text
-
alto__text.xsl
-
- Convert hOCR to plain text
-
hocr__text.xsl
-
- Language codes
-
codes_lookup.xml
- generated with https://github.com/filak/iso-language-codes
-