deepdive icon indicating copy to clipboard operation
deepdive copied to clipboard

Character encoding problem

Open mcavdar opened this issue 7 years ago • 1 comments

Hello,

I'm trying to process French corpus and keep getting always result with encoding problem like 'américaine' instead of 'américaine' even after I've changed LC_ALL variable in 'shell/deepdive' file appropriately for French like that.

  • In articles table , there is no problem with encoding. It occurs after corenlp process in sentences table.

Any kind of help would be greatly appreciated. Thanks

edit: I've added some debug lines in tsj2corenlp-http-reqs file and realized even before corenlp request each sentences has encoding problem. edit2: I've tracked it until database/db-driver/postgresql/db-query-tsj . I think problem is about psycopg2 module. edit3: I think problem is python 2. When I tried to request with psycopg2 in python3 result has not encoding problem. But after modified database/db-driver/postgresql/db-query-tsj for python3 (#!/usr/bin/env python -> #!/usr/bin/env python3 ) I'm getting another error:

... 2017-05-25 15:00:20.582789 Loading dd_tmp_sentences from /home/mc/quaer-encode/run/process/ext_sentences_by_nlp_markup/deepdive-compute-execute.la1vGzv/output_computed-1 (tsj format) 2017-05-25 15:00:20.669839 Traceback (most recent call last): 2017-05-25 15:00:20.669914 File "/home/mc/local/util/db-driver/postgresql/db-query-tsj", line 6, in 2017-05-25 15:00:20.669939 import psycopg2, psycopg2.extras, ujson 2017-05-25 15:00:20.669961 File "/home/mc/local/lib/bundled/python-lib/prefix/lib/python2.7/site-packages/psycopg2/init.py", line 50, in 2017-05-25 15:00:20.669981 from psycopg2._psycopg import ( # noqa 2017-05-25 15:00:20.670002 ImportError: /home/mc/local/lib/bundled/python-lib/prefix/lib/python2.7/site-packages/psycopg2/_psycopg.so: undefined symbol: PyUnicodeUCS4_DecodeUTF8 2017-05-25 15:00:20.690613 /home/mc/local/util/compute-driver/local/compute-execute: ligne 129 : kill: (10546) - No such process 2017-05-25 15:00:20.693066 [ERROR] deepdive-unload: PID 10546: finished with non-zero exit status (1)...

I don't know why it tries to use bundle of python2.7. Any idea?

mcavdar avatar May 24 '17 13:05 mcavdar

I'd retitle this "Deepdive character encoding problem". As you've already determined, the problem isn't with CoreNLP, which handles French and character encodings just fine….

manning avatar Jun 11 '17 15:06 manning