tesseract icon indicating copy to clipboard operation
tesseract copied to clipboard

Always do serialization in little endian order

Open amitdo opened this issue 2 years ago • 13 comments

https://github.com/tesseract-ocr/tesseract/issues/518#issuecomment-277514434

@stweil commented on 5 Feb 2017

There are different approaches possible to get support for big endian machines:

  1. Write training data files in native endian byte order. When reading that data Tesseract must automatically detect the endianness used and convert it to native byte order if necessary, that means if training (=writing) machine and OCR (=reading) machine use different byte order.
  2. Write training data files with fixed endianness. When reading that data Tesseract must only convert the data if it uses a different byte order.

The current code obviously tries to implement the first variant: it uses swap parameters for the functions which read data and sets the value of swap based on the training data.

I prefer the second variant and suggest to always use little endian training data files. Then the most common little endian platforms can use the data without the need to do byte swaps. Big endian hosts can use fixed code to convert the data when reading or writing them. This results in less complex code: the swap parameters are no longer needed.

amitdo avatar Sep 06 '22 15:09 amitdo

https://github.com/tesseract-ocr/tesseract/pull/1784#issue-342279018

@stweil commented [on 18 Jul 2018

With the new API it will be very easy to switch to a fixed little endian file format as soon as it is used for all serialization code.

amitdo avatar Sep 06 '22 15:09 amitdo

The only big endian architecture that is supported by LTS Linux distros is s390x (IBM Mainframe).

amitdo avatar Sep 06 '22 15:09 amitdo

IMO, we can just drop the big endian support.

amitdo avatar Sep 08 '22 17:09 amitdo

Tesseract has no SIMD support for s390x, so It's a huge waste of time and money to run Tesseract on IBM mainframes.

amitdo avatar Sep 13 '22 07:09 amitdo

@stweil, can you comment about this issue?

amitdo avatar Sep 13 '22 10:09 amitdo

@stweil Does this mean, (nearly) all existing training files are in little endian order? Thus no problems with existing training files except for IBM Mainframes?

wollmers avatar Sep 13 '22 13:09 wollmers

@stweil, can you comment about this issue?

Yes, sure.

Does this mean, (nearly) all existing training files are in little endian order? Thus no problems with existing training files except for IBM Mainframes?

Yes, all Tesseract training data which I know of uses little endian order. There are some more architectures which use big endian, for example Sparc which I used for my tests in the past, but also OpenRISC. And other hardware supports both big and little endianness, for example ARM and MIPS.

Tesseract has no SIMD support for s390x, so It's a huge waste of time and money to run Tesseract on IBM mainframes.

That's only a part of the story. There is no special code for s390x, but src/arch/dotproduct.cpp is pure C(++) code which can be nearly as fast as optimized special code. Depending on the compiler it will use SIMD instructions, too.

IMO, we can just drop the big endian support.

In my opinion it should be possible to run Tesseract on big endian hosts. I'd prefer the solution which I already explained in https://github.com/tesseract-ocr/tesseract/issues/518#issuecomment-277514434 and use little endian data everywhere. That can be achieved in several steps:

  1. Drop endianness check and swapping when reading data on LE machines. That makes the code smaller and (very little) faster for the majority of platforms. We only loose support for (non existing) models which were created on a big endian machine. Such models must be rejected with an error message.
  2. Use unconditional swapping when reading data on BE machines. That also makes the code smaller, and we loose support for (non existing) models which were created on a big endian machine.
  3. Use unconditional swapping when writing data on BE machines. Then those machines will write LE data, and everything is fine again.

stweil avatar Sep 13 '22 13:09 stweil

Since we are going to keep the BE support, maybe we should test it regularly with GitHub actions using QEMU.

https://github.com/uraimo/run-on-arch-action

amitdo avatar Sep 13 '22 14:09 amitdo

QEMU. 😄 That's how I did my first tests. QEMU uses emulation, and a test requires a full build inside the emulation which takes a lot of time (and energy!). Isn't testing it randomly every few years and relying on user reports enough? I doubt that there are critical applications using Tesseract OCR on big endian machines which would justify more efforts.

stweil avatar Sep 13 '22 14:09 stweil

I doubt that there are critical applications using Tesseract OCR on big endian machines which would justify more efforts.

That's why I thought we can drop it...

amitdo avatar Sep 13 '22 14:09 amitdo

Drop endianness check and swapping when reading data on LE machines.... models which were created on a big endian machine... must be rejected with an error message.

How can you reject them without checking the endianness first?

amitdo avatar Sep 20 '22 12:09 amitdo

OK, I think I found the answer. Only check once during model loading.

https://github.com/tesseract-ocr/tesseract/blob/d1912d70100d6e13ff0241c478735bbca7256d1e/src/ccutil/tessdatamanager.cpp#L120

amitdo avatar Sep 20 '22 14:09 amitdo

@stweil,

I don't think we should bother with step 2 and 3. What about doing just step 1?

Another option is to keep the current code without doing any change related to endianness.

amitdo avatar Sep 28 '22 09:09 amitdo