tesseract
tesseract copied to clipboard
Always do serialization in little endian order
https://github.com/tesseract-ocr/tesseract/issues/518#issuecomment-277514434
@stweil commented on 5 Feb 2017
There are different approaches possible to get support for big endian machines:
- Write training data files in native endian byte order. When reading that data Tesseract must automatically detect the endianness used and convert it to native byte order if necessary, that means if training (=writing) machine and OCR (=reading) machine use different byte order.
- Write training data files with fixed endianness. When reading that data Tesseract must only convert the data if it uses a different byte order.
The current code obviously tries to implement the first variant: it uses swap parameters for the functions which read data and sets the value of swap based on the training data.
I prefer the second variant and suggest to always use little endian training data files. Then the most common little endian platforms can use the data without the need to do byte swaps. Big endian hosts can use fixed code to convert the data when reading or writing them. This results in less complex code: the swap parameters are no longer needed.
https://github.com/tesseract-ocr/tesseract/pull/1784#issue-342279018
@stweil commented [on 18 Jul 2018
With the new API it will be very easy to switch to a fixed little endian file format as soon as it is used for all serialization code.
The only big endian architecture that is supported by LTS Linux distros is s390x (IBM Mainframe).
IMO, we can just drop the big endian support.
Tesseract has no SIMD support for s390x, so It's a huge waste of time and money to run Tesseract on IBM mainframes.
@stweil, can you comment about this issue?
@stweil Does this mean, (nearly) all existing training files are in little endian order? Thus no problems with existing training files except for IBM Mainframes?
@stweil, can you comment about this issue?
Yes, sure.
Does this mean, (nearly) all existing training files are in little endian order? Thus no problems with existing training files except for IBM Mainframes?
Yes, all Tesseract training data which I know of uses little endian order. There are some more architectures which use big endian, for example Sparc which I used for my tests in the past, but also OpenRISC. And other hardware supports both big and little endianness, for example ARM and MIPS.
Tesseract has no SIMD support for s390x, so It's a huge waste of time and money to run Tesseract on IBM mainframes.
That's only a part of the story. There is no special code for s390x, but src/arch/dotproduct.cpp
is pure C(++) code which can be nearly as fast as optimized special code. Depending on the compiler it will use SIMD instructions, too.
IMO, we can just drop the big endian support.
In my opinion it should be possible to run Tesseract on big endian hosts. I'd prefer the solution which I already explained in https://github.com/tesseract-ocr/tesseract/issues/518#issuecomment-277514434 and use little endian data everywhere. That can be achieved in several steps:
- Drop endianness check and swapping when reading data on LE machines. That makes the code smaller and (very little) faster for the majority of platforms. We only loose support for (non existing) models which were created on a big endian machine. Such models must be rejected with an error message.
- Use unconditional swapping when reading data on BE machines. That also makes the code smaller, and we loose support for (non existing) models which were created on a big endian machine.
- Use unconditional swapping when writing data on BE machines. Then those machines will write LE data, and everything is fine again.
Since we are going to keep the BE support, maybe we should test it regularly with GitHub actions using QEMU.
https://github.com/uraimo/run-on-arch-action
QEMU. 😄 That's how I did my first tests. QEMU uses emulation, and a test requires a full build inside the emulation which takes a lot of time (and energy!). Isn't testing it randomly every few years and relying on user reports enough? I doubt that there are critical applications using Tesseract OCR on big endian machines which would justify more efforts.
I doubt that there are critical applications using Tesseract OCR on big endian machines which would justify more efforts.
That's why I thought we can drop it...
Drop endianness check and swapping when reading data on LE machines.... models which were created on a big endian machine... must be rejected with an error message.
How can you reject them without checking the endianness first?
OK, I think I found the answer. Only check once during model loading.
https://github.com/tesseract-ocr/tesseract/blob/d1912d70100d6e13ff0241c478735bbca7256d1e/src/ccutil/tessdatamanager.cpp#L120
@stweil,
I don't think we should bother with step 2 and 3. What about doing just step 1?
Another option is to keep the current code without doing any change related to endianness.