ocr-fileformat icon indicating copy to clipboard operation
ocr-fileformat copied to clipboard

ALTO output: Missing <SP> tags between <String> tags

Open jbarth-ubhd opened this issue 6 years ago • 24 comments

Perhaps this is not an error. Kind regards, J. Barth

jbarth-ubhd avatar Dec 22 '17 08:12 jbarth-ubhd

Can you provide sample data and how you ran the tool?

kba avatar Dec 22 '17 08:12 kba

I guess you output the ALTO files directly from ABBYY, because we don't yet provide a transormation from ABBYY to ALTO. Then this should be an example: https://digi.bib.uni-mannheim.de/~stweil/ocr-praxis/Testseiten/alto/417576986_0031.xml . The <SP> stands AFAIK for space and it does validate in this form.

zuphilip avatar Dec 22 '17 09:12 zuphilip

Yes, I'll try to find out if <SP> (=space) is really necessary between <String>s in ALTO.

jbarth-ubhd avatar Dec 22 '17 09:12 jbarth-ubhd

I guess that it still validates without the SP tags. Moreover, most of the information (HPOS, WIDTH) can be calculated from the line above and below, but if the width of a space is important for some application, then it might be easier to have this data directly. I don't know what the VPOS information for a space says or whether it is also determined by some other values.

zuphilip avatar Dec 22 '17 09:12 zuphilip

On ALTO 2.1 .xsd it looks like this:

  <xsd:sequence maxOccurs="unbounded">
    <xsd:element name="String" type="StringType"/>
    <xsd:element name="SP" minOccurs="0"> ...
    </xsd:element>
  </xsd:sequence>

So strictly speaking it seems that <SP> is not necessary, but the <sequence> seems to imply it.

jbarth-ubhd avatar Dec 22 '17 10:12 jbarth-ubhd

but the seems to imply it.

Not sure. I only see here, that, if <SP> occurs, then it has to occur after a <String>.

zuphilip avatar Dec 22 '17 10:12 zuphilip

Here is an ALTO file generated with Tesseract (see https://github.com/tesseract-ocr/tesseract/pull/2067). Another page was processed by ABBYY Finereader.

While ABBYY adds the <SP> tags, Tesseract (and ocr-fileformat) does not. As the <String> tags contain the surrounding box positions and the distance of two text boxes can be calculated without additional information, that looks sufficient at a first glance. But without the <SP> the DFG viewer does not separate the words!

I am not sure whether this is a bug of the DFG viewer (and Kitodo Presentation) or whether ALTO requires explicit tags for the whitespace between words. Perhaps @sebastian-meyer or @cneud know the answer?

stweil avatar Nov 22 '18 21:11 stweil

The ALTO documentation says "A TextBlock is divided into lines and those are divided into strings, spaces and hyphens". I don't interpret that as a strict requirement that spaces are required, and nor does the .xsd. It's clear that spaces are required if the strings are given without HPOS and WIDTH attributes, but I think it is redundant if those attributes are available.

stweil avatar Nov 22 '18 21:11 stweil

The ALTO spec itself needs to clarify this issue.

amitdo avatar Nov 23 '18 10:11 amitdo

Clemens has created an issue for that: https://github.com/altoxml/schema/issues/54 (thank you).

stweil avatar Nov 23 '18 11:11 stweil

Thanks for flagging this, I will put it on the agenda for our next ALTO board call which will be held November 29th.

cneud avatar Nov 23 '18 11:11 cneud

To chip in, I've interpreted the standard that the <SP><String> alternation is mandatory (sequence definition of <TextLine> contents) and that whitespace should never occur inside a <String> and this is how I implemented it.

mittagessen avatar Nov 24 '18 19:11 mittagessen

If a <String> never contains whitespace, then <SP> is completely redundant. Does ALTO allow overlapping words in a row? If yes, does that require a separating space with negative width? :-)

stweil avatar Nov 24 '18 20:11 stweil

If a <String> never contains whitespace, then <SP> is completely redundant.

Why? Whitespace is a character like any other and personally I would've taken the decision to encode it explicitly using <String> if the standard wouldn't heavily imply that you shouldn't do that. Of course, you can throw away the data and let people compute inter-word spacing implicitly provided through word bounding boxes but it isn't like tesseract, kraken or any other sequence classification based OCR engine doesn't output a label for whitespace (and the boundaries of that activation can almost certainly differ from the boundary of the activations of the adjacent letters). I'd rather not throw away metadata that some weird subdiscipline in the humanities that only the 8 people participating in it have ever heard about might need.

Does ALTO allow overlapping words in a row?

ALTO luckily allows overlapping elements in constrast to PageXML.

mittagessen avatar Nov 24 '18 20:11 mittagessen

Then how would you encode two overlapping words if you are forced to put a <SP> between them?

stweil avatar Nov 24 '18 21:11 stweil

Just have overlapping bounding boxes? Presumably there is still a reading order that determines the ordering of the <String> tags. But yeah it helps that I decided a long time ago that words are a waaay to squishy concept and arbitrarily defined anything bounded by whitespace is a separate word/segment for serialization purposes (not only for ALTO). Of course, I you want to encode a proper tokenization, this data model shouldn't be used. On the other hand, I'm of firm conviction that starting to do that in a raw OCR serialization format is only going to lead to madness.

mittagessen avatar Nov 24 '18 23:11 mittagessen

Just to follow up - I'm afraid a quick resolve is not really around the corner...the issue was discussed in the last ALTO board call, with the core elements of the discussion summarized here.

While the general feeling was that the use of <SP> is not mandatory, some more research into ALTO's history is required to determine the original authors exact intentions.

An expansion of the <SP> tag with a width attribute has been identified among board members as a possibility to create more useful future applications for the <SP> tag.

If one really wants to be on the safe side, the quick solution right now would be to indeed include <SP> in the output of any ALTO export implementation as it is also straightforward to remove in post-processing.

cneud avatar Dec 13 '18 16:12 cneud

ALTO's history is required to determine the original authors exact intentions.

As a note, most of the character-based classification systems common at the time ALTO was originally specified didn't treat whitespace as a proper glyph, i.e. whitespace is just something bordered by other glyphs and is never seen by the classifier as such. This at least explains the existence of a separate <SP> tag.

mittagessen avatar Dec 13 '18 16:12 mittagessen

Thank you, @cneud, @mittagessen and the ALTO board.

As the current DFG viewer expects the <SP> tags, I think that programs like ocr-transform should produce them, too. Pull request https://github.com/tesseract-ocr/tesseract/pull/2117 adds the tags to Tesseract's new ALTO output, so that output is now compatible with the DFG viewer.

stweil avatar Dec 13 '18 16:12 stweil

The addition of the <SP> should be handled upstream in the corresponding transformation. Currently, we use hocr2alto and page2alto. We can keep this issue here open as a reminder.

zuphilip avatar Dec 30 '19 13:12 zuphilip

According to the ALTO XSD the SP tag is optional - minOccurs="0"

And I do not see a way how to reliably calculate HEIGHT/WIDTH/VPOS/HPOS attributes from the hOCR data for the SP tag.

IMHO - proper handling of optional SP tag should be fixed by DFG viewer.

filak avatar Jan 02 '20 16:01 filak

If the <SP> is not mandatory, we have to "ignore" it in the styles of the fulltext view and always make a space after a <STRING>.

This is what I've done in the DFG-Viewer styles now. Please have a look at the current master of the DFG-Viewer at test.dfg-viewer.de.

Please compare the example from above in current master and in version 5.0 of DFG-Viewer and report change requests.

albig avatar Jan 03 '20 14:01 albig

@albig IMHO the second one seems better from user perspective - it is more readable/compact.

filak avatar Jan 03 '20 15:01 filak

@albig IMHO the spacing looks better now (in master), but the linebreaks seem a bit random...

sebastian-meyer avatar Jan 06 '20 15:01 sebastian-meyer