ocr-fileformat
ocr-fileformat copied to clipboard
Cannot convert hOCR with xhtml namespace to ALTO 2.1
$ ocr-transform hocr alto2.1 in.html out.xml
Error
SXXP0005: The source document is in namespace http://www.w3.org/1999/xhtml, but all the
template rules match elements in no namespace (Use --suppressXsltNamespaceCheck:on to
avoid this warning)
hOCR:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.w3.org/1999/xhtml" xsi:schemaLocation="http://www.w3.org/1999/xhtml http://www.w3.org/MarkUp/SCHEMA/xhtml11.xsd">
<head>
<title>some-id</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8"/>
<meta http-equiv="content-style-type" content="text/css"/>
<meta name="ocr-capabilities" content="ocr_page ocr_par ocr_block ocrx_block ocrx_word ocr_line"/>
</head>
<!-- .... -->
</html>
When I suppress the namespace check, the resulting ALTO file does not contain any elements besides the root <alto> with the OCR text in plaintext without any structuring elements.
Hello JBaiter, it sounds like the namespaces are declared but the other XHTML elements are inheriting the namespace, because they do not have the XHTML namespace explicitly declared. It would be perfect if you could upload an example file, that i can confirm my theory. What program did you use to create the sourcefile?
The file comes from a partner's custom OCR engine. Unfortunately I don't think I'm allowed to share a sample file, but here's a small excerpt that should suffice to confirm your theory, the elements indeed don't explicitely state the namespace but inherit it from the root element:
<span class="ocr_line" title="bbox 394 1972 1419 2019;x_wconf 86">
<span class="ocrx_word" title="bbox 394 1972 447 2019;x_wconf 95">In</span> <span class="ocrx_word" title="bbox 459 1972 561 2019;x_wconf 95">Paris</span> <span class="ocrx_word" title="bbox 573 1972 656 2019;x_wconf 78">trägt</span> <span class="ocrx_word" title="bbox 669 1972 745 2019;x_wconf 95">man</span> <span class="ocrx_word" title="bbox 752 1972 1125 2019;x_wconf 61">Busen-Hemdknöpfchen</span> <span class="ocrx_word" title="bbox 1139 1972 1200 2019;x_wconf 95">mit</span> <span class="ocrx_word" title="bbox 1208 1972 1369 2019;x_wconf 95">Brillanten</span> <span class="ocrx_word" title="bbox 1376 1972 1419 2019;x_wconf 78">ein</span>
</span>
The XSLT scripts use local-name only, non-namespaced, c.f. https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl. I think I ran into this before https://github.com/filak/hOCR-to-ALTO/commit/9f8026cd2b61bd842aa40dff5598f2d0bbd19b07 .
Sorry for the late response. I am still trying to fix the problem. Your code snippets runs without any problems on my local machine but not on the server. And it seems that i have another parsing issue. For now i encourage you to update the program and and give it another try.
For the example file https://raw.githubusercontent.com/kba/ocr-fileformat-samples/master/samples/hocr/1.1/417576986_0013.hocr I can run hocr2alto2.0 and the result looks fine. However, when I run hocr2alto2.1 the result looks not okay. But this happens only when I try to use the web GUI. Is this a bug on our side?
With the updated SAXON and updated hocr__alto scripts I cannot anymore reproduce this issue. The file I have linked above works fine for all transformations in v0.3.0. @jbaiter Can you test your examples files now again and let us now if there is still a problem? Otherwise I suggest to close this issue here.