pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

ALTO version with latest release

Open ghost opened this issue 5 years ago • 6 comments

Previously, we used pdfalto to generate an ALTO XML from the pdf and https://github.com/filak/hOCR-to-ALTO to convert the ALTO XML to hOCR file after that. With the newest release of pdfalto this does not work anymore, since the ALTO version has seemed to have changed. Can you share which version of ALTO is currently produced with pdfalto?

ghost avatar Mar 04 '19 14:03 ghost

Could you provide more informations, are there any logs of the error stack trace ?

Aazhar avatar Mar 05 '19 12:03 Aazhar

The alto schema version didn't change, version 3.1 is used since the first pdfalto release : https://github.com/kermitt2/pdfalto/blob/master/schema/alto.xsd

Aazhar avatar Mar 05 '19 13:03 Aazhar

Earlier the schemain the alto xml was: xmlns="http://www.loc.gov/standards/alto/ns-v3#", but now I get: xmlns="http://www.loc.gov/standards/alto/v3/alto.xsd"

ghost avatar Mar 05 '19 14:03 ghost

this was updated because the first link is wrong, it's not pointing to the schema.

Aazhar avatar Mar 05 '19 14:03 Aazhar

@Aazhar Schema-location and Namespace URL don't have to be identical. xmlns should be http://www.loc.gov/standards/alto/ns-v3# (see targetNamespace="http://www.loc.gov/standards/alto/ns-v3#" in http://www.loc.gov/standards/alto/v3/alto.xsd)

For schema location, you can use something like

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd"

burki avatar Jul 01 '19 14:07 burki

Added xsi:schemaLocation with d49bf77204d1700b7263cb2641aa508c33058c9c

<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd">

kermitt2 avatar Apr 07 '21 14:04 kermitt2