tesseract
tesseract copied to clipboard
The hocr output is not displayed as xhtml in Chrome
Current Behavior
Tesseract's hocr output is not displayed as xhtml in Chrome.
Firefox displays the hocr file as expected.
Expected Behavior
No response
Suggested Fix
If I change the file extension from hocr to xhtml the file is displayed as expected in Chrome.
tesseract -v
tesseract 5.3.1 leptonica-1.83.1 libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0 Found AVX Found SSE4.1 Found OpenMP 201511
Operating System
Ubuntu 22.04 Jammy
Other Operating System
No response
uname -a
Linux amit-desktop 5.19.0-38-generic #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Compiler
gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)
CPU
No response
Virtualization / Containers
No response
Other Information
Chrome Version: 111.0.5563.146 (Official Build) snap (64-bit)
I think using either .html or _hocr.html as a suffix would probably be more reconizable to users than .xhtml. @MerlijnWajer seems to be using _hocr.html so presumably he would be in favor of that.
I agree with the suggestion to change the file extension from .hocr to _hocr.html.
@stweil, @zdenop, your opinion?
CC: @kba, @bertsky
Thanks for the mention @tfmorris. I would like to add that I went for _hocr.html purely for archive.org internal reasons, to match how archive.org's naming scheme (to indicate the contents of the document rather than just the scheme). I would be fine if Tesseract would like to use that here (it's clear), but I would also see the virtues of just .html.
Notably, the hOCR spec says the following:
File extension(s):
*.html, *.hocr
So that would probably rule out *.xhtml if one wanted to conform to the spec.
So that would probably rule out
*.xhtmlif one wanted to conform to the spec.
The .hocr extension is a problem when working with local files. When serving hOCR from web server, setting the Content-Type to something browsers interpret as (X)HTML works but if it has to be a local file, I tend to use .hocr.html myself.
In other words, should we change the spec to recommend for file extension, in order of preference *.hocr.html, *.hocr.xhtml *.html *.xhtml *.hocr?
Personally I'd prefer .html because a duplicate extension like .hocr.html is not very common and I am not sure how libraries which can split filenames would handle this special case if they are asked to get the extension.
We have an (unrelated) problem with the extension for ALTO XML as soon as we add support for PAGE XML (see https://github.com/JKamlah/tesseract/tree/PageExport) because both formats use .xml by default.
Maybe we should keep the current defaults, but add new parameters file_extension_hocr, file_extension_alto and so on which can override the default extension including the leading dot. Then everybody is free to choose any desired extension. With additional support for system wide and personal Tesseract configuration files this would also be easy to use.
I shared @stweil 's concerns about multiple dots confusing things, which is why I suggested _hocr.html, but I'd be fine with just .html (IA has both _chocr.html and _hocr.html which is not an issue if we're just generating a single file). As for adding additional naming parameters, it's pretty simple for a script / workflow to rename output files, so this is more about getting the default to be user friendly, whether those parameters exist or not.
Having said all that, I think it's a low priority issue because there is an easy workaround (renaming the file yourself).
So I suggest to keep the .hocr for now because that requires the least efforts and does not break compatiblity with existing Tesseract releases.
I am not sure how libraries which can split filenames would handle this special case if they are asked to get the extension.
>>> from pathlib import Path
>>> Path('foo.hocr.html').suffix
'.html'
>>> Path('foo.hocr.html').suffixes
['.hocr', '.html']
So, it is easy to get this wrong. Users are still free to use the .hocr.html convention if they so choose, if we recommend *.html as the preferred extension.
Maybe we should keep the current defaults, but add new parameters
file_extension_hocr,file_extension_altoand so on which can override the default extension including the leading dot. Then everybody is free to choose any desired extension. With additional support for system wide and personal Tesseract configuration files this would also be easy to use.
That is a good idea to make configurable in any case.