tesseract.js
tesseract.js copied to clipboard
Huge perfomance gap between C++ version and WASM version
Hello,
I recently made a performance comparison test between the latest alpha release of Tesseract v5.0.0-alpha (C++) and the latest release of tesseractjs v2 (Node/typescript)
The JS/WASM port appears to be about 30 times slower than the C++ compiled executable to process the attached image.
On the same platform (Windows 10 64 bits), in both cases with 1 worker:
- tesseract input_test_3.png out -l fra txt hocr => 4 secs
- tesseractjs (executed with NodeJs 12.4.1) : 120 secs
Both timings include the initialization and model loading steps, but most of the gap seems to occur during OCR processing.
This test image is intendedly a poor quality scan intended to test OCR quality resilience, but this x30 performance ratio seems general on all tests I performed.
Is such a performance gap what is expected with this WASM port? According to published reports (e.g. https://arxiv.org/pdf/1901.09056.pdf), WASM performance penalty vs native is expected to be up to x3, but not x30...
Best Regards
Does it work in both cases though? How do you manage to associate the extracted data of each row with the right column?
Closing as the significant performance disparity between C++ and wasm was resolved in the latest release 3.0.0. Recognition speeds using the .wasm version should be in the same ballpark as the desktop version now.
Note: anything run on Safari will continue to be significantly slower as Apple has not yet implemented SIMD support. If you're reading this in the future, the following link should show if that has changed yet.
https://webassembly.org/roadmap/