tesserocr
tesserocr copied to clipboard
Latest tesseract (4.0) failed to build tesserocr
ON Ubuntu 14.04,Python 2.7
when I git the tesserocr, pip installed. then appear the follow infos:
.....
{'TESSERACT_VERSION': 262144},
In file included from tesserocr.cpp:309:0:
/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
static string UTF32ToUTF8(const std::vector<char32>& str32);
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:321:
/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str) {
.....
This particular problem arise when you didn't compile Tesseract with the standard library. It was added 2 weeks ago in the default makefile so you might want to rebuild tesseract from source with a clean pull.
I met the same problem druing pip installing, I did pull the latest tesseract as revision: tesseract-ocr/tesseract@5f5e85e4a0827cf44f57bc87344d3fa15b067a75. @Belval which revision i should pull to get tesserocr compiled?
tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0 should be good, but did you sudo make uninstall
before reinstalling Tesseract 4?
@Belval yes, I did uninstall before reinstall. Anyway i would like try to reinstall again with the revision you are giving.
@zqy2084 If you still experience an issue with that revision, please post the error, it'll be easier to debug.
Still get the same compile error with reinstall tesseract on revision: tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0
Debian 8.7 Python 2.7
(ocr_env_py2.7)jeff@debian88:~/Projects/tesseract$ CPPFLAGS=-I/usr/local/include pip install tesserocr
Downloading/unpacking tesserocr
http://mirrors.aliyun.com/pypi/simple/tesserocr/ uses an insecure transport scheme (http). Consider using https if mirrors.aliyun.com has it available
Downloading tesserocr-2.2.2.tar.gz (53kB): 53kB downloaded
Running setup.py (path:/tmp/pip-build-fMBaIA/tesserocr/setup.py) egg_info for package tesserocr
Supporting tesseract v4.00.00dev
Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
Compiling tesserocr.pyx because it changed.
[1/1] Cythonizing tesserocr.pyx
warning: no previously-included files found matching '*.so'
Installing collected packages: tesserocr
Running setup.py install for tesserocr
Supporting tesseract v4.00.00dev
Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
building 'tesserocr' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -I/usr/local/include -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from tesserocr.cpp:528:0:
/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
static string UTF32ToUTF8(const std::vector<char32>& str32);
^
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:540:
/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str) {
^
/usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str, int length);
^
/usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:
/usr/local/include/tesseract/unicharset.h:265:5: error: ‘string’ was not declared in this scope
string cleaned = CleanupString(unichar_repr);
^
/usr/local/include/tesseract/unicharset.h:265:5: note: suggested alternative:
In file included from /usr/include/c++/4.9/string:39:0,
from /usr/local/include/tesseract/unichar.h:25,
from tesserocr.cpp:528:
/usr/include/c++/4.9/bits/stringfwd.h:62:33: note: ‘std::string’
typedef basic_string<char> string;
^
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:540:
/usr/local/include/tesseract/unicharset.h:266:9: error: ‘cleaned’ was not declared in this scope
if (cleaned != unichar_repr) {
^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
Complete output from command /home/jeff/Projects/ocr_env_py2.7/bin/python2 -c "import setuptools, tokenize;__file__='/tmp/pip-build-fMBaIA/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-lXHKOs-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/jeff/Projects/ocr_env_py2.7/include/site/python2.7:
Supporting tesseract v4.00.00dev
Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
running install
running build
running build_ext
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-2.7
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -I/usr/local/include -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from tesserocr.cpp:528:0:
/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
static string UTF32ToUTF8(const std::vector<char32>& str32);
^
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:540:
/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str) {
^
/usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str, int length);
^
/usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:
/usr/local/include/tesseract/unicharset.h:265:5: error: ‘string’ was not declared in this scope
string cleaned = CleanupString(unichar_repr);
^
/usr/local/include/tesseract/unicharset.h:265:5: note: suggested alternative:
In file included from /usr/include/c++/4.9/string:39:0,
from /usr/local/include/tesseract/unichar.h:25,
from tesserocr.cpp:528:
/usr/include/c++/4.9/bits/stringfwd.h:62:33: note: ‘std::string’
typedef basic_string<char> string;
^
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:540:
/usr/local/include/tesseract/unicharset.h:266:9: error: ‘cleaned’ was not declared in this scope
if (cleaned != unichar_repr) {
^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------
Cleaning up...
Command /home/jeff/Projects/ocr_env_py2.7/bin/python2 -c "import setuptools, tokenize;__file__='/tmp/pip-build-fMBaIA/tesserocr/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-lXHKOs-record/install-record.txt --single-version-externally-managed --compile --install-headers /home/jeff/Projects/ocr_env_py2.7/include/site/python2.7 failed with error code 1 in /tmp/pip-build-fMBaIA/tesserocr
Traceback (most recent call last):
File "/home/jeff/Projects/ocr_env_py2.7/bin/pip", line 11, in <module>
sys.exit(main())
File "/home/jeff/Projects/ocr_env_py2.7/local/lib/python2.7/site-packages/pip/__init__.py", line 248, in main
return command.main(cmd_args)
File "/home/jeff/Projects/ocr_env_py2.7/local/lib/python2.7/site-packages/pip/basecommand.py", line 161, in main
text = '\n'.join(complete_log)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 42: ordinal not in range(128)
(ocr_env_py2.7)jeff@debian88:~/Projects/tesseract$ tesseract --version
tesseract 4.00.00alpha
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.3.1) : libpng 1.2.50 : libtiff 4.0.3 : zlib 1.2.8
Found AVX
Found SSE
~~master seems to be have some api files missing (version.h
for example) not sure why~~. If you want to use tesseract v4, I suggest installing one of the 4.00.00* tags (4.00.00dev is the latest right now) instead of master to get something that works.
Edit: missing files only when installing using cmake, regular install is fine (but tesserocr fails to compile with same error as reported).
I was able to install tesserocr after replacing string
with std::string
in /usr/local/include/tesseract/unicharset.h
and /usr/local/include/tesseract/unichar.h
.
@sirfz we can certainly do hardcoding the include files of tesseract to make the tesserocr compiled success, however i do believe it is supposed to do no hardcoding change to make it done, specially not on tesseract.
@zqy2084 yes of course, since tesseract v4 is still an alpha version, installation can break at any moment that's why you should install a release instead (i.e. 4.00.00alpha
or 4.00.00dev
). This hack is just for you if you really want to use the latest bleeding edge.
@sirfz This is quite unsettling because they fixed it less than 3 weeks ago, I'll check if I can reproduce the issue tonight and reopen an issue on their repo if I get a case.
@zqy2084 If you did get that error with this particular commit, the problem is with a previously installed version of Tesseract 4 on your system. I can assure you that it builds with that commit.
@Belval I installed the latest pull from the master branch. These changes were added in unichar.h on 2017-07-14 (tesseract-ocr/tesseract@da03e4e9105b6262706d40ef2b4436eae4ebe19f) and unicharset.h on 2017-07-24 (tesseract-ocr/tesseract@b0ead95d64a3667339775b2f99ac37e97e90c2a0).
@sirfz Yes but I opened this issue and the next day they changed it to add -DUSING_STD_NAMESPACE
as a fix. The commit I pointed to should work.
The fix didn't work for me @Belval:
$ python setup.py build_ext --inplace 1 ↵
Supporting tesseract v4.00.00dev
Configs from pkg-config: {'libraries': ['lept', 'tesseract'], 'cython_compile_time_env': {'TESSERACT_VERSION': 262144}, 'library_dirs': ['/usr/local/lib'], 'include_dirs': ['/usr/local/include']}
running build_ext
building 'tesserocr' extension
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fdebug-prefix-map=/build/python2.7-ZZaKJ6/python2.7-2.7.13=. -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11 -DUSING_STD_NAMESPACE
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from tesserocr.cpp:529:0:
/usr/local/include/tesseract/unichar.h:164:10: error: ‘string’ does not name a type
static string UTF32ToUTF8(const std::vector<char32>& str32);
^~~~~~
In file included from /usr/local/include/tesseract/osdetect.h:24:0,
from tesserocr.cpp:541:
/usr/local/include/tesseract/unicharset.h:241:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str) {
^~~~~~
/usr/local/include/tesseract/unicharset.h:244:10: error: ‘string’ does not name a type
static string CleanupString(const char* utf8_str, int length);
^~~~~~
/usr/local/include/tesseract/unicharset.h: In member function ‘void UNICHARSET::unichar_insert_backwards_compatible(const char*)’:
/usr/local/include/tesseract/unicharset.h:265:53: error: ‘CleanupString’ was not declared in this scope
std::string cleaned = CleanupString(unichar_repr);
^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
As you can see -DUSING_STD_NAMESPACE
is used but doesn't help. Is there something I'm missing?
@sirfz Ok, I'll report back with results
@sirfz @zqy2084 Can confirm that it is broken again... I'll report back when I find a decent fix.
I had the same error for a different reason:
running install
running build
running build_ext
building 'tesserocr' extension
creating build
creating build/temp.linux-x86_64-2.7
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -I/usr/local/include -I/usr/include/python2.7 -c tesserocr.cpp -o build/temp.linux-x86_64-2.7/tesserocr.o -std=c++11
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
tesserocr.cpp: In function ‘tesseract::TessResultRenderer* __pyx_f_9tesserocr_13PyTessBaseAPI__get_renderer(__pyx_obj_9tesserocr_PyTessBaseAPI*, __pyx_t_9tesseract_cchar_t*)’:
tesserocr.cpp:21042:124: error: no matching function for call to ‘tesseract::TessPDFRenderer::TessPDFRenderer(__pyx_t_9tesseract_cchar_t*&, const char*, bool&)’
__pyx_t_7 = new tesseract::TessPDFRenderer(__pyx_v_outputbase, __pyx_v_self->_baseapi.GetDatapath(), __pyx_v_textonly);
^
In file included from tesserocr.cpp:539:0:
/usr/local/include/tesseract/renderer.h:189:3: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(const char*, const char*)
TessPDFRenderer(const char *outputbase, const char *datadir);
^
/usr/local/include/tesseract/renderer.h:189:3: note: candidate expects 2 arguments, 3 provided
/usr/local/include/tesseract/renderer.h:185:16: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(const tesseract::TessPDFRenderer&)
class TESS_API TessPDFRenderer : public TessResultRenderer {
^
/usr/local/include/tesseract/renderer.h:185:16: note: candidate expects 1 argument, 3 provided
/usr/local/include/tesseract/renderer.h:185:16: note: candidate: tesseract::TessPDFRenderer::TessPDFRenderer(tesseract::TessPDFRenderer&&)
/usr/local/include/tesseract/renderer.h:185:16: note: candidate expects 1 argument, 3 provided
tesserocr.cpp: In function ‘PyObject* __pyx_pf_9tesserocr_13PyTessBaseAPI_96DetectOrientationScript(__pyx_obj_9tesserocr_PyTessBaseAPI*)’:
tesserocr.cpp:23345:39: error: ‘class tesseract::TessBaseAPI’ has no member named ‘DetectOrientationScript’
__pyx_t_1 = (__pyx_v_self->_baseapi.DetectOrientationScript((&__pyx_v_orient_deg), (&__pyx_v_orient_conf), (&__pyx_v_script_name), (&__pyx_v_script_conf)) != 0);
^
tesserocr.cpp: In function ‘void inittesserocr()’:
tesserocr.cpp:34528:69: error: ‘OEM_TESSERACT_LSTM_COMBINED’ is not a member of ‘tesseract’
__pyx_t_1 = __Pyx_PyInt_From_enum__tesseract_3a__3a_OcrEngineMode(tesseract::OEM_TESSERACT_LSTM_COMBINED); if (unlikely(!__pyx_t_1)) __PYX_ERR(0, 83, __pyx_L1_error)
^
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1
4.00.00alpha and 4.00.00dev releases aren't supported any more, you'll need to compile a newer version. If the issue reported here hasn't been fixed already in master then I suggest installing tesseract-ocr/tesseract@f5c18f78c09ab028791c28c638d5cc2f96c6d6fb instead (last commit before the problem was introduced).
The correct directive is -DUSE_STD_NAMESPACE
(see tesseract-ocr/tesseract#1045). You should be able to install the current tesserocr version by adding CPPFLAGS=-DUSE_STD_NAMESPACE
prefix to the installation command.
I fixed use this method:
- git clone this source code
- pip install .
thanks @sirfz At last I was able to build correctly tesserocr replacing string with std::string in /usr/local/include/tesseract/unicharset.h and /usr/local/include/tesseract/unichar.h
Tesseract is now at beta 1 stage.
This version uses std:string everywhere. USE_STD_NAMESPACE
is not needed anymore.
@sirfz, you can revert c8464d13a.
Thanks @amitdo, reopened until this is fixed
So I got this working with the 4.00-beta. All I ended up doing was changing 0x040000 to 0x04 in setup.py, tesseract.pxd and tesserocd.pyx, and also removed the -DUSE_STD_NAMESPACE flag from setup.py. Then it compiled correctly only giving the -Wstrict-prototypes warning. pip install . was successful, tesserocr imports and works.
The only difference seems to be that I need to include path='/mypathtotessdata/' in the PyTessBaseApi context manager.
edit: on second thought this probably breaks installation for previous versions... edit2: the version_to_int function in setup.py finds a version string of 4.0.0, which would get turned into 1024 (ie) 0x0400. So I'm guessing 0x0400 would be the correct value for all those expressions that check the version?
@amitdo tesserocr compiled correctly for me without any changes against beta.1 (checked out release tesseract-ocr/tesseract@40f43111e05b3dd2f2f8aeae3aba33016523c881). Is this a change after this release?
@IntegralTriad changing the version number to 0x04
is not a solution and actually breaks compilation against 3.x. Also, version_to_int
returns 262144
(for 4 beta.1 as well as alpha) which equates to 0x40000
.
@amitdo tesserocr compiled correctly for me without any changes against beta.1 (checked out release tesseract-ocr/tesseract@40f4311). Is this a change after this release?
https://github.com/ropensci/tesseract/issues/24#issuecomment-373133986
Is this a change after this release?
Yes.
I didn't say your code will fail to compile with newer tesseract... Still, c8464d1 is not necessary anymore.