tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

does not compile against libtesseract anymore

Open bertsky opened this issue 4 years ago • 7 comments

With the current master, I cannot pip install anymore:

  Building wheel for tesserocr (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /data/venv/bin/python3 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-yyb85qtw/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-yyb85qtw/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-e_hk2brj
       cwd: /tmp/pip-req-build-yyb85qtw/
  Complete output (418 lines):
  Supporting tesseract v5.0.0-alpha-622-g7d94
  Tesseract major version 5
  Configs from pkg-config: {'library_dirs': ['/usr/local/lib', '/usr/local/lib'], 'include_dirs': ['/usr/local/include', '/usr/local/include', '/usr/local/include'], 'libraries': ['tesseract', 'archive', 'curl', 'lept'], 'compile_time_env': {'TESSERACT_MAJOR_VERSION': 5, 'TESSERACT_VERSION': 1234798114}}
  running bdist_wheel
  running build
  running build_ext
  Detected compiler: unix
  building 'tesserocr' extension
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/local/include -I/usr/local/include -I/usr/local/include -I/data/venv/include -I/usr/include/python3.6m -c tesserocr.cpp -o build/temp.linux-x86_64-3.6/tesserocr.o -std=c++11 -DUSE_STD_NAMESPACE
  tesserocr.cpp:1905:91: error: ‘PolyBlockType’ does not name an enumeration in ‘tesseract’
   static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_PolyBlockType(enum tesseract::PolyBlockType value);
                                                                                             ^~~~~~~~~
  tesserocr.cpp:1905:102: error: ‘PolyBlockType’ in namespace ‘tesseract’ does not name a type
   static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_PolyBlockType(enum tesseract::PolyBlockType value);
                                                                                                        ^~~~~~~~~~~~~
  tesserocr.cpp:1920:99: error: ‘StrongScriptDirection’ does not name an enumeration in ‘tesseract’
   static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_StrongScriptDirection(enum tesseract::StrongScriptDirection value);
                                                                                                     ^~~~~~~~~
  tesserocr.cpp:1920:110: error: ‘StrongScriptDirection’ in namespace ‘tesseract’ does not name a type
   static CYTHON_INLINE PyObject* __Pyx_PyInt_From_enum__tesseract_3a__3a_StrongScriptDirection(enum tesseract::StrongScriptDirection value);
                                                                                                                ^~~~~~~~~~~~~~~~~~~~~
  tesserocr.cpp: In function ‘int __pyx_f_9tesserocr_13PyTessBaseAPI__init_api(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_10tesseract5_cchar_t*, __pyx_t_10tesseract5_cchar_t*, tesseract::OcrEngineMode, char**, int, const std::vector<std::__cxx11::basic_string<char> >*, const std::vector<std::__cxx11::basic_string<char> >*, bool, tesseract::PageSegMode)’:
  tesserocr.cpp:14592:197: error: no matching function for call to ‘tesseract::TessBaseAPI::Init(__pyx_t_10tesseract5_cchar_t*&, __pyx_t_10tesseract5_cchar_t*&, tesseract::OcrEngineMode&, char**&, int&, const std::vector<std::__cxx11::basic_string<char> >*&, const std::vector<std::__cxx11::basic_string<char> >*&, bool&)’
       __pyx_v_ret = __pyx_v_self->_baseapi.Init(__pyx_v_path, __pyx_v_lang, __pyx_v_oem, __pyx_v_configs, __pyx_v_configs_size, __pyx_v_vars_vec, __pyx_v_vars_vals, __pyx_v_set_only_non_debug_params);
                                                                                                                                                                                                       ^
  In file included from tesserocr.cpp:694:0:
  /usr/local/include/tesseract/baseapi.h:219:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*, tesseract::OcrEngineMode, char**, int, const GenericVector<STRING>*, const GenericVector<STRING>*, bool)
     int Init(const char* datapath, const char* language, OcrEngineMode mode,
         ^~~~
  /usr/local/include/tesseract/baseapi.h:219:7: note:   no known conversion for argument 6 from ‘const std::vector<std::__cxx11::basic_string<char> >*’ to ‘const GenericVector<STRING>*’
  /usr/local/include/tesseract/baseapi.h:224:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*, tesseract::OcrEngineMode)
     int Init(const char* datapath, const char* language, OcrEngineMode oem) {
         ^~~~
  /usr/local/include/tesseract/baseapi.h:224:7: note:   candidate expects 3 arguments, 8 provided
  /usr/local/include/tesseract/baseapi.h:227:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, const char*)
     int Init(const char* datapath, const char* language) {
         ^~~~
  /usr/local/include/tesseract/baseapi.h:227:7: note:   candidate expects 2 arguments, 8 provided
  /usr/local/include/tesseract/baseapi.h:233:7: note: candidate: int tesseract::TessBaseAPI::Init(const char*, int, const char*, tesseract::OcrEngineMode, char**, int, const GenericVector<STRING>*, const GenericVector<STRING>*, bool, tesseract::FileReader)
     int Init(const char* data, int data_size, const char* language,
         ^~~~
...

Is my Tesseract too old (i.e. have there been breaking API changes recently in Tesseract 5) perhaps?

bertsky avatar Jul 02 '21 21:07 bertsky

The above was Python 3.6 / Tesseract v5.0.0-alpha-622-g7d94 / gcc 7.5.0. I get the same on Python 3.7 / Tesseract v5.0.0-alpha-626-gddb6 / gcc 8.3.0. Cython is the newest 0.29.23.

bertsky avatar Jul 02 '21 21:07 bertsky

Bisection revealed this happened at 8a98bf4421307c4b019a696b6ecbda95be6b7a08. The error also goes away with most recent git version of Tesseract, v5.0.0-alpha-20210401.

That's a regression: tesserocr used to be backwards compatible and flexible. @stweil?

bertsky avatar Jul 02 '21 22:07 bertsky

Backwards compatible here means that it must work with the official releases (4.1.1). And it must work with the latest releases of Tesseract 5.0.

stweil avatar Jul 03 '21 08:07 stweil

Backwards compatible here means that it must work with the official releases (4.1.1). And it must work with the latest releases of Tesseract 5.0.

No, it used to be that tesserocr is compatible with a wide range of Tesseract versions, if necessary differentiating them with ifdefs to encapsulate differences to the Python user. But 8a98bf4 introduced a blanket condition TESSERACT_MAJOR_VERSION >= 5 which apparently conflates some API changes, and it brought the unfortunate situation that there are now two source files to keep synchronized, tesseract.pxd and tesseract5.pxd.

bertsky avatar Jul 03 '21 10:07 bertsky

I don't think that it is necessary that Tesserocr supports old or intermediate revisions of Tesseract which are completely unsupported (and buggy).

stweil avatar Jul 03 '21 11:07 stweil

tesseract 5 is still in development so tesserocr cannot guarantee compatibility since it can break at any moment, all stable releases >=3.04 are supported and so will version 5 once it's released.

sirfz avatar Jul 03 '21 13:07 sirfz

Tesseract master seems to have been supported by tesserocr for a long time, though, despite the extra effort. Especially during the long time after LSTMs had been (hastily) integrated. And at least trying to support the alpha is not just a matter of convenience: many projects depend on the Python bindings to test and advance new features. Why is this being turned down so lightly? (It should be easy for those who made the respective changes in Tesseract recently to differentiate APIs by exact version.)

Also, I still see this as the most pressing problem here:

and it brought the unfortunate situation that there are now two source files to keep synchronized, tesseract.pxd and tesseract5.pxd.

bertsky avatar Sep 13 '21 12:09 bertsky