tesserocr icon indicating copy to clipboard operation
tesserocr copied to clipboard

Latest tesseract (4.0) failed to build tesserocr

Open cheermao opened this issue 7 years ago • 27 comments

ON Ubuntu 14.04,Python 2.7

when I git the tesserocr, pip installed. then appear the follow infos:

.....
{'TESSERACT_VERSION': 262144}, 

 In file included from tesserocr.cpp:309:0:
    /usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
       static string UTF32ToUTF8(const std::vector<char32>& str32);

In file included from /usr/local/include/tesseract/osdetect.h:24:0,
                     from tesserocr.cpp:321:
    /usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
       static string CleanupString(const char* utf8_str) {
.....

cheermao avatar Aug 01 '17 01:08 cheermao

This particular problem arise when you didn't compile Tesseract with the standard library. It was added 2 weeks ago in the default makefile so you might want to rebuild tesseract from source with a clean pull.

Belval avatar Aug 01 '17 02:08 Belval

I met the same problem druing pip installing, I did pull the latest tesseract as revision: tesseract-ocr/tesseract@5f5e85e4a0827cf44f57bc87344d3fa15b067a75. @Belval which revision i should pull to get tesserocr compiled?

zqy2084 avatar Aug 08 '17 02:08 zqy2084

tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0 should be good, but did you sudo make uninstall before reinstalling Tesseract 4?

Belval avatar Aug 08 '17 03:08 Belval

@Belval yes, I did uninstall before reinstall. Anyway i would like try to reinstall again with the revision you are giving.

zqy2084 avatar Aug 08 '17 03:08 zqy2084

@zqy2084 If you still experience an issue with that revision, please post the error, it'll be easier to debug.

Belval avatar Aug 08 '17 03:08 Belval

Still get the same compile error with reinstall tesseract on revision: tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0

Debian 8.7    Python 2.7

(ocr_env_py2.7)jeff@debian88:~/Projects/tesseract$ CPPFLAGS=-I/usr/local/include pip install tesserocr
Downloading/unpacking tesserocr
  http://mirrors.aliyun.com/pypi/simple/tesserocr/ uses an insecure transport scheme (http). Consider using https if mirrors.aliyun.com has it available
  Downloading tesserocr-2.2.2.tar.gz (53kB): 53kB downloaded
  Running setup.py (path:/tmp/pip-build-fMBaIA/tesserocr/setup.py) egg_info for package tesserocr
    Supporting tesseract v4.00.00dev
    Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
    Compiling tesserocr.pyx because it changed.
    [1/1] Cythonizing tesserocr.pyx
    warning: no previously-included files found matching '*.so'
Installing collected packages: tesserocr
  Running setup.py install for tesserocr
    Supporting tesseract v4.00.00dev
    Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
    building 'tesserocr' extension
    x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -I/usr/local/include -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11
    cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
    In file included from tesserocr.cpp:528:0:
    /usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
       static string UTF32ToUTF8(const std::vector<char32>& str32);
              ^
    In file included from /usr/local/include/tesseract/osdetect.h:24:0,
                     from tesserocr.cpp:540:
    /usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
       static string CleanupString(const char* utf8_str) {
              ^
    /usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type
       static string CleanupString(const char* utf8_str, int length);
              ^
    /usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:
    /usr/local/include/tesseract/unicharset.h:265:5: error: ‘string’ was not declared in this scope
         string cleaned = CleanupString(unichar_repr);
         ^
    /usr/local/include/tesseract/unicharset.h:265:5: note: suggested alternative:
    In file included from /usr/include/c++/4.9/string:39:0,
                     from /usr/local/include/tesseract/unichar.h:25,
                     from tesserocr.cpp:528:
    /usr/include/c++/4.9/bits/stringfwd.h:62:33: note:   ‘std::string’
       typedef basic_string<char>    string;
                                     ^
    In file included from /usr/local/include/tesseract/osdetect.h:24:0,
                     from tesserocr.cpp:540:
    /usr/local/include/tesseract/unicharset.h:266:9: error: ‘cleaned’ was not declared in this scope
         if (cleaned != unichar_repr) {
             ^
    error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
    Complete output from command /home/jeff/Projects/ocr_env_py2.7/bin/python2 -c "import setuptools, tokenize;__file__='/tmp/pip-build-fMBaIA/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-lXHKOs-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/jeff/Projects/ocr_env_py2.7/include/site/python2.7:
    Supporting tesseract v4.00.00dev

Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}

running install

running build

running build_ext

building 'tesserocr' extension

creating build

creating build/temp.linux-x86_64-2.7

x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -I/usr/local/include -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11

cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++

In file included from tesserocr.cpp:528:0:

/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type

   static string UTF32ToUTF8(const std::vector<char32>& str32);

          ^

In file included from /usr/local/include/tesseract/osdetect.h:24:0,

                 from tesserocr.cpp:540:

/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type

   static string CleanupString(const char* utf8_str) {

          ^

/usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type

   static string CleanupString(const char* utf8_str, int length);

          ^

/usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:

/usr/local/include/tesseract/unicharset.h:265:5: error: ‘string’ was not declared in this scope

     string cleaned = CleanupString(unichar_repr);

     ^

/usr/local/include/tesseract/unicharset.h:265:5: note: suggested alternative:

In file included from /usr/include/c++/4.9/string:39:0,

                 from /usr/local/include/tesseract/unichar.h:25,

                 from tesserocr.cpp:528:

/usr/include/c++/4.9/bits/stringfwd.h:62:33: note:   ‘std::string’

   typedef basic_string<char>    string;

                                 ^

In file included from /usr/local/include/tesseract/osdetect.h:24:0,

                 from tesserocr.cpp:540:

/usr/local/include/tesseract/unicharset.h:266:9: error: ‘cleaned’ was not declared in this scope

     if (cleaned != unichar_repr) {

         ^

error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

----------------------------------------
Cleaning up...
Command /home/jeff/Projects/ocr_env_py2.7/bin/python2 -c "import setuptools, tokenize;__file__='/tmp/pip-build-fMBaIA/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-lXHKOs-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/jeff/Projects/ocr_env_py2.7/include/site/python2.7 failed with error code 1 in /tmp/pip-build-fMBaIA/tesserocr
Traceback (most recent call last):
  File "/home/jeff/Projects/ocr_env_py2.7/bin/pip", line 11, in <module>
    sys.exit(main())
  File "/home/jeff/Projects/ocr_env_py2.7/local/lib/python2.7/site-packages/pip/__init__.py", line 248, in main
    return command.main(cmd_args)
  File "/home/jeff/Projects/ocr_env_py2.7/local/lib/python2.7/site-packages/pip/basecommand.py", line 161, in main
    text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42: ordinal not in range(128)

(ocr_env_py2.7)jeff@debian88:~/Projects/tesseract$ tesseract --version
tesseract 4.00.00alpha
 leptonica-1.74.4
  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8

 Found AVX
 Found SSE

zqy2084 avatar Aug 08 '17 03:08 zqy2084

~~master seems to be have some api files missing (version.h for example) not sure why~~. If you want to use tesseract v4, I suggest installing one of the 4.00.00* tags (4.00.00dev is the latest right now) instead of master to get something that works.

Edit: missing files only when installing using cmake, regular install is fine (but tesserocr fails to compile with same error as reported).

sirfz avatar Aug 08 '17 13:08 sirfz

I was able to install tesserocr after replacing string with std::string in /usr/local/include/tesseract/unicharset.h and /usr/local/include/tesseract/unichar.h.

sirfz avatar Aug 08 '17 13:08 sirfz

@sirfz we can certainly do hardcoding the include files of tesseract to make the tesserocr compiled success, however i do believe it is supposed to do no hardcoding change to make it done, specially not on tesseract.

zqy2084 avatar Aug 08 '17 14:08 zqy2084

@zqy2084 yes of course, since tesseract v4 is still an alpha version, installation can break at any moment that's why you should install a release instead (i.e. 4.00.00alpha or 4.00.00dev). This hack is just for you if you really want to use the latest bleeding edge.

sirfz avatar Aug 08 '17 14:08 sirfz

@sirfz This is quite unsettling because they fixed it less than 3 weeks ago, I'll check if I can reproduce the issue tonight and reopen an issue on their repo if I get a case.

@zqy2084 If you did get that error with this particular commit, the problem is with a previously installed version of Tesseract 4 on your system. I can assure you that it builds with that commit.

Belval avatar Aug 08 '17 15:08 Belval

@Belval I installed the latest pull from the master branch. These changes were added in unichar.h on 2017-07-14 (tesseract-ocr/tesseract@da03e4e9105b6262706d40ef2b4436eae4ebe19f) and unicharset.h on 2017-07-24 (tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0).

sirfz avatar Aug 08 '17 16:08 sirfz

@sirfz Yes but I opened this issue and the next day they changed it to add -DUSING_STD_NAMESPACE as a fix. The commit I pointed to should work.

Belval avatar Aug 08 '17 16:08 Belval

The fix didn't work for me @Belval:

$ python setup.py build_ext --inplace                                                                                                                                                                        1 ↵
Supporting tesseract v4.00.00dev
Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
running build_ext
building 'tesserocr' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-ZZaKJ6/python2.7-2.7.13=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11 -DUSING_STD_NAMESPACE
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from tesserocr.cpp:529:0:
/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
   static string UTF32ToUTF8(const std::vector<char32>& str32);
          ^~~~~~
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
                 from tesserocr.cpp:541:
/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
   static string CleanupString(const char* utf8_str) {
          ^~~~~~
/usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type
   static string CleanupString(const char* utf8_str, int length);
          ^~~~~~
/usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:
/usr/local/include/tesseract/unicharset.h:265:53: error: ‘CleanupString’ was not declared in this scope
     std::string cleaned = CleanupString(unichar_repr);
                                                     ^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

As you can see -DUSING_STD_NAMESPACE is used but doesn't help. Is there something I'm missing?

sirfz avatar Aug 08 '17 16:08 sirfz

@sirfz Ok, I'll report back with results

Belval avatar Aug 08 '17 17:08 Belval

@sirfz @zqy2084 Can confirm that it is broken again... I'll report back when I find a decent fix.

Belval avatar Aug 08 '17 17:08 Belval

I had the same error for a different reason:

running install
running build
running build_ext
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-2.7
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
tesserocr.cpp: In function ‘tesseract::TessResultRenderer* __pyx_f_9tesserocr_13PyTessBaseAPI__get_renderer(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_9tesseract_cchar_t*)’:
tesserocr.cpp:21042:124: error: no matching function for call to ‘tesseract::TessPDFRenderer::TessPDFRenderer(__pyx_t_9tesseract_cchar_t*&, const char*, bool&)’
       __pyx_t_7 = new tesseract::TessPDFRenderer(__pyx_v_outputbase, __pyx_v_self->_baseapi.GetDatapath(), __pyx_v_textonly);
                                                                                                                            ^
In file included from tesserocr.cpp:539:0:
/usr/local/include/tesseract/renderer.h:189:3: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(const char*, const char*)
   TessPDFRenderer(const char *outputbase, const char *datadir);
   ^
/usr/local/include/tesseract/renderer.h:189:3: note:   candidate expects 2 arguments, 3 provided
/usr/local/include/tesseract/renderer.h:185:16: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(const tesseract::TessPDFRenderer&)
 class TESS_API TessPDFRenderer : public TessResultRenderer {
                ^
/usr/local/include/tesseract/renderer.h:185:16: note:   candidate expects 1 argument, 3 provided
/usr/local/include/tesseract/renderer.h:185:16: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(tesseract::TessPDFRenderer&&)
/usr/local/include/tesseract/renderer.h:185:16: note:   candidate expects 1 argument, 3 provided
tesserocr.cpp: In function ‘PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_96DetectOrientationScript(__pyx_obj_9tesserocr_PyTessBaseAPI*)’:
tesserocr.cpp:23345:39: error: ‘class tesseract::TessBaseAPI’ has no member named ‘DetectOrientationScript’
   __pyx_t_1 = (__pyx_v_self->_baseapi.DetectOrientationScript((&__pyx_v_orient_deg), (&__pyx_v_orient_conf), (&__pyx_v_script_name), (&__pyx_v_script_conf)) != 0);
                                       ^
tesserocr.cpp: In function ‘void inittesserocr()’:
tesserocr.cpp:34528:69: error: ‘OEM_TESSERACT_LSTM_COMBINED’ is not a member of ‘tesseract’
   __pyx_t_1 = __Pyx_PyInt_From_enum__tesseract_3a__3a_OcrEngineMode(tesseract::OEM_TESSERACT_LSTM_COMBINED); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 83, __pyx_L1_error)
                                                                     ^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

TerryZH avatar Aug 13 '17 08:08 TerryZH

4.00.00alpha and 4.00.00dev releases aren't supported any more, you'll need to compile a newer version. If the issue reported here hasn't been fixed already in master then I suggest installing tesseract-ocr/tesseract@f5c18f78c09ab028791c28c638d5cc2f96c6d6fb instead (last commit before the problem was introduced).

sirfz avatar Aug 13 '17 10:08 sirfz

The correct directive is -DUSE_STD_NAMESPACE (see tesseract-ocr/tesseract#1045). You should be able to install the current tesserocr version by adding CPPFLAGS=-DUSE_STD_NAMESPACE prefix to the installation command.

sirfz avatar Aug 17 '17 14:08 sirfz

I fixed use this method:

  1. git clone this source code
  2. pip install .

yumaofan avatar Sep 14 '17 02:09 yumaofan

thanks @sirfz At last I was able to build correctly tesserocr replacing string with std::string in /usr/local/include/tesseract/unicharset.h and /usr/local/include/tesseract/unichar.h

antikytheraton avatar Dec 06 '17 20:12 antikytheraton

Tesseract is now at beta 1 stage.

This version uses std:string everywhere. USE_STD_NAMESPACE is not needed anymore. @sirfz, you can revert c8464d13a.

amitdo avatar Mar 14 '18 18:03 amitdo

Thanks @amitdo, reopened until this is fixed

sirfz avatar Mar 15 '18 18:03 sirfz

So I got this working with the 4.00-beta. All I ended up doing was changing 0x040000 to 0x04 in setup.py, tesseract.pxd and tesserocd.pyx, and also removed the -DUSE_STD_NAMESPACE flag from setup.py. Then it compiled correctly only giving the -Wstrict-prototypes warning. pip install . was successful, tesserocr imports and works.

The only difference seems to be that I need to include path='/mypathtotessdata/' in the PyTessBaseApi context manager.

edit: on second thought this probably breaks installation for previous versions... edit2: the version_to_int function in setup.py finds a version string of 4.0.0, which would get turned into 1024 (ie) 0x0400. So I'm guessing 0x0400 would be the correct value for all those expressions that check the version?

IntegralTriad avatar Mar 20 '18 02:03 IntegralTriad

@amitdo tesserocr compiled correctly for me without any changes against beta.1 (checked out release tesseract-ocr/tesseract@40f43111e05b3dd2f2f8aeae3aba33016523c881). Is this a change after this release?

@IntegralTriad changing the version number to 0x04 is not a solution and actually breaks compilation against 3.x. Also, version_to_int returns 262144 (for 4 beta.1 as well as alpha) which equates to 0x40000.

sirfz avatar Mar 22 '18 14:03 sirfz

@amitdo tesserocr compiled correctly for me without any changes against beta.1 (checked out release tesseract-ocr/tesseract@40f4311). Is this a change after this release?

https://github.com/ropensci/tesseract/issues/24#issuecomment-373133986

amitdo avatar Mar 22 '18 15:03 amitdo

Is this a change after this release?

Yes.

I didn't say your code will fail to compile with newer tesseract... Still, c8464d1 is not necessary anymore.

amitdo avatar Mar 22 '18 15:03 amitdo