tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

The hocr output is not displayed as xhtml in Chrome

Open amitdo opened this issue 2 years ago • 9 comments

Current Behavior

Tesseract's hocr output is not displayed as xhtml in Chrome.

Firefox displays the hocr file as expected.

Expected Behavior

No response

Suggested Fix

If I change the file extension from hocr to xhtml the file is displayed as expected in Chrome.

tesseract -v

tesseract 5.3.1 leptonica-1.83.1 libgif 5.1.9 : libjpeg 8d (libjpeg-turbo 2.1.2) : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.2 : libopenjp2 2.4.0 Found AVX Found SSE4.1 Found OpenMP 201511

Operating System

Ubuntu 22.04 Jammy

Other Operating System

No response

uname -a

Linux amit-desktop 5.19.0-38-generic #39~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 17 21:16:15 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Compiler

gcc version 11.3.0 (Ubuntu 11.3.0-1ubuntu1~22.04)

CPU

No response

Virtualization / Containers

No response

Other Information

Chrome Version: 111.0.5563.146 (Official Build) snap (64-bit)

amitdo avatar Apr 02 '23 10:04 amitdo

hocr.zip

amitdo avatar Apr 02 '23 10:04 amitdo

I think using either .html or _hocr.html as a suffix would probably be more reconizable to users than .xhtml. @MerlijnWajer seems to be using _hocr.html so presumably he would be in favor of that.

tfmorris avatar Nov 13 '23 19:11 tfmorris

I agree with the suggestion to change the file extension from .hocr to _hocr.html.

@stweil, @zdenop, your opinion?

CC: @kba, @bertsky

amitdo avatar Nov 14 '23 11:11 amitdo

Thanks for the mention @tfmorris. I would like to add that I went for _hocr.html purely for archive.org internal reasons, to match how archive.org's naming scheme (to indicate the contents of the document rather than just the scheme). I would be fine if Tesseract would like to use that here (it's clear), but I would also see the virtues of just .html.

Notably, the hOCR spec says the following:

File extension(s):

    *.html, *.hocr

So that would probably rule out *.xhtml if one wanted to conform to the spec.

MerlijnWajer avatar Nov 14 '23 11:11 MerlijnWajer

So that would probably rule out *.xhtml if one wanted to conform to the spec.

The .hocr extension is a problem when working with local files. When serving hOCR from web server, setting the Content-Type to something browsers interpret as (X)HTML works but if it has to be a local file, I tend to use .hocr.html myself.

In other words, should we change the spec to recommend for file extension, in order of preference *.hocr.html, *.hocr.xhtml *.html *.xhtml *.hocr?

kba avatar Nov 14 '23 11:11 kba

Personally I'd prefer .html because a duplicate extension like .hocr.html is not very common and I am not sure how libraries which can split filenames would handle this special case if they are asked to get the extension.

We have an (unrelated) problem with the extension for ALTO XML as soon as we add support for PAGE XML (see https://github.com/JKamlah/tesseract/tree/PageExport) because both formats use .xml by default.

Maybe we should keep the current defaults, but add new parameters file_extension_hocr, file_extension_alto and so on which can override the default extension including the leading dot. Then everybody is free to choose any desired extension. With additional support for system wide and personal Tesseract configuration files this would also be easy to use.

stweil avatar Nov 14 '23 13:11 stweil

I shared @stweil 's concerns about multiple dots confusing things, which is why I suggested _hocr.html, but I'd be fine with just .html (IA has both _chocr.html and _hocr.html which is not an issue if we're just generating a single file). As for adding additional naming parameters, it's pretty simple for a script / workflow to rename output files, so this is more about getting the default to be user friendly, whether those parameters exist or not.

Having said all that, I think it's a low priority issue because there is an easy workaround (renaming the file yourself).

tfmorris avatar Nov 14 '23 15:11 tfmorris

So I suggest to keep the .hocr for now because that requires the least efforts and does not break compatiblity with existing Tesseract releases.

stweil avatar Nov 14 '23 15:11 stweil

I am not sure how libraries which can split filenames would handle this special case if they are asked to get the extension.

>>> from pathlib import Path
>>> Path('foo.hocr.html').suffix
'.html'
>>> Path('foo.hocr.html').suffixes
['.hocr', '.html']

So, it is easy to get this wrong. Users are still free to use the .hocr.html convention if they so choose, if we recommend *.html as the preferred extension.

Maybe we should keep the current defaults, but add new parameters file_extension_hocr, file_extension_alto and so on which can override the default extension including the leading dot. Then everybody is free to choose any desired extension. With additional support for system wide and personal Tesseract configuration files this would also be easy to use.

That is a good idea to make configurable in any case.

kba avatar Nov 14 '23 15:11 kba