tesserocr
tesserocr copied to clipboard
does not compile against libtesseract anymore
With the current master, I cannot pip install anymore:
Building wheel for tesserocr (setup.py) ... error
ERROR: Command errored out with exit status 1:
command: /data/venv/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-yyb85qtw/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-yyb85qtw/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-e_hk2brj
cwd: /tmp/pip-req-build-yyb85qtw/
Complete output (418 lines):
Supporting tesseract v5.0.0-alpha-622-g7d94
Tesseract major version 5
Configs from pkg-config: {'library_dirs': ['/usr/local/lib', '/usr/local/lib'], 'include_dirs': ['/usr/local/include', '/usr/local/include', '/usr/local/include'], 'libraries': ['tesseract', 'archive', 'curl', 'lept'], 'compile_time_env': {'TESSERACT_MAJOR_VERSION': 5, 'TESSERACT_VERSION': 1234798114}}
running bdist_wheel
running build
running build_ext
Detected compiler: unix
building 'tesserocr' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/data/venv/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
tesserocr.cpp:1905:91: error: ‘PolyBlockType’ does not name an enumeration in ‘tesseract’
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_PolyBlockType(enum tesseract::PolyBlockType value);
^~~~~~~~~
tesserocr.cpp:1905:102: error: ‘PolyBlockType’ in namespace ‘tesseract’ does not name a type
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_PolyBlockType(enum tesseract::PolyBlockType value);
^~~~~~~~~~~~~
tesserocr.cpp:1920:99: error: ‘StrongScriptDirection’ does not name an enumeration in ‘tesseract’
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_StrongScriptDirection(enum tesseract::StrongScriptDirection value);
^~~~~~~~~
tesserocr.cpp:1920:110: error: ‘StrongScriptDirection’ in namespace ‘tesseract’ does not name a type
static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_StrongScriptDirection(enum tesseract::StrongScriptDirection value);
^~~~~~~~~~~~~~~~~~~~~
tesserocr.cpp: In function ‘int __pyx_f_9tesserocr_13PyTessBaseAPI__init_api(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_10tesseract5_cchar_t*, __pyx_t_10tesseract5_cchar_t*, tesseract::OcrEngineMode, char**, int, const std::vector<std::__cxx11::basic_string<char> >*, const std::vector<std::__cxx11::basic_string<char> >*, bool, tesseract::PageSegMode)’:
tesserocr.cpp:14592:197: error: no matching function for call to ‘tesseract::TessBaseAPI::Init(__pyx_t_10tesseract5_cchar_t*&, __pyx_t_10tesseract5_cchar_t*&, tesseract::OcrEngineMode&, char**&, int&, const std::vector<std::__cxx11::basic_string<char> >*&, const std::vector<std::__cxx11::basic_string<char> >*&, bool&)’
__pyx_v_ret = __pyx_v_self->_baseapi.Init(__pyx_v_path, __pyx_v_lang, __pyx_v_oem, __pyx_v_configs, __pyx_v_configs_size, __pyx_v_vars_vec, __pyx_v_vars_vals, __pyx_v_set_only_non_debug_params);
^
In file included from tesserocr.cpp:694:0:
/usr/local/include/tesseract/baseapi.h:219:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*, tesseract::OcrEngineMode, char**, int, const GenericVector<STRING>*, const GenericVector<STRING>*, bool)
int Init(const char* datapath, const char* language, OcrEngineMode mode,
^~~~
/usr/local/include/tesseract/baseapi.h:219:7: note: no known conversion for argument 6 from ‘const std::vector<std::__cxx11::basic_string<char> >*’ to ‘const GenericVector<STRING>*’
/usr/local/include/tesseract/baseapi.h:224:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*, tesseract::OcrEngineMode)
int Init(const char* datapath, const char* language, OcrEngineMode oem) {
^~~~
/usr/local/include/tesseract/baseapi.h:224:7: note: candidate expects 3 arguments, 8 provided
/usr/local/include/tesseract/baseapi.h:227:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*)
int Init(const char* datapath, const char* language) {
^~~~
/usr/local/include/tesseract/baseapi.h:227:7: note: candidate expects 2 arguments, 8 provided
/usr/local/include/tesseract/baseapi.h:233:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, int, const char*, tesseract::OcrEngineMode, char**, int, const GenericVector<STRING>*, const GenericVector<STRING>*, bool, tesseract::FileReader)
int Init(const char* data, int data_size, const char* language,
^~~~
...
Is my Tesseract too old (i.e. have there been breaking API changes recently in Tesseract 5) perhaps?
The above was Python 3.6 / Tesseract v5.0.0-alpha-622-g7d94 / gcc 7.5.0. I get the same on Python 3.7 / Tesseract v5.0.0-alpha-626-gddb6 / gcc 8.3.0. Cython is the newest 0.29.23.
Bisection revealed this happened at 8a98bf4421307c4b019a696b6ecbda95be6b7a08. The error also goes away with most recent git version of Tesseract, v5.0.0-alpha-20210401.
That's a regression: tesserocr used to be backwards compatible and flexible. @stweil?
Backwards compatible here means that it must work with the official releases (4.1.1). And it must work with the latest releases of Tesseract 5.0.
Backwards compatible here means that it must work with the official releases (4.1.1). And it must work with the latest releases of Tesseract 5.0.
No, it used to be that tesserocr is compatible with a wide range of Tesseract versions, if necessary differentiating them with ifdefs to encapsulate differences to the Python user. But 8a98bf4 introduced a blanket condition TESSERACT_MAJOR_VERSION >= 5 which apparently conflates some API changes, and it brought the unfortunate situation that there are now two source files to keep synchronized, tesseract.pxd and tesseract5.pxd.
I don't think that it is necessary that Tesserocr supports old or intermediate revisions of Tesseract which are completely unsupported (and buggy).
tesseract 5 is still in development so tesserocr cannot guarantee compatibility since it can break at any moment, all stable releases >=3.04 are supported and so will version 5 once it's released.
Tesseract master seems to have been supported by tesserocr for a long time, though, despite the extra effort. Especially during the long time after LSTMs had been (hastily) integrated. And at least trying to support the alpha is not just a matter of convenience: many projects depend on the Python bindings to test and advance new features. Why is this being turned down so lightly? (It should be easy for those who made the respective changes in Tesseract recently to differentiate APIs by exact version.)
Also, I still see this as the most pressing problem here:
and it brought the unfortunate situation that there are now two source files to keep synchronized,
tesseract.pxdandtesseract5.pxd.