py-corenlp icon indicating copy to clipboard operation
py-corenlp copied to clipboard

issue when handling UTF-8

Open sasinda opened this issue 9 years ago • 3 comments

Though the StanfordCoreNLPServer output is 'UTF-8' the response is interpreted as 'ISO-8859-1' corenlp.py line 13.

Server was started using: java -Xmx6g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer 9000 -encoding utf-8

input

apatheism — which describes individuals no question there’s a connection

output snippets

Notice '’' is replaced by a '’s' {
"index":9, "word":"'s", "originalText":"’s", "characterOffsetBegin":57, "characterOffsetEnd":59, "pos":"VBZ", "before":"", "after":" " } — is also jumbled. {
"index":2, "word":"--", "originalText":"—", "characterOffsetBegin":10, "characterOffsetEnd":11, "pos":":", "before":" ", "after":" " },

Well the server also does the mistake of setting the header: 'Content-type': 'text/json' instead of 'content-type': 'application/json; charset=utf-8'

sasinda avatar May 19 '16 00:05 sasinda

Just released an update after merging in #6. Does it work for you now?

smilli avatar May 19 '16 16:05 smilli

Hi thank you very much for the quick update. I didnt check it since I changed the annotate method code internally so that I can override to any encoding type and it would default to server's response content type (which is ascii at the moment). Think this might be a better solution as people may need other encoding types. def annotate(self, text, properties=None, encoding=None): #''' set encoding to override the response to be interprited as the given encoding''' if not properties: properties = {} r = requests.get( self.server_url, params={ 'properties': str(properties) }, data=text) if encoding: r.encoding=encoding

sasinda avatar May 21 '16 18:05 sasinda

Hi, I think the problem is not the output. The problem is, that non ASCII characters are escaped in data = text.encode() (line 25 in corenlp.py) and the CoreNLP Server does not handle this properly. For example if I post the the sentence "Mr. Miller went to Bremen." Miller is correctly annotated as "PERSON" and Bremen as "LOCATION". But if I send the sentence "Mr. Miller went to Köln.", Köln is not recognized as Location but as PERSON(!), no matter which encoding I use. So in fact, even after figuring out how to use a different language model and add an encoding: output = nlp.annotate(text, properties={'annotators': 'ner', 'outputFormat': "json", "ner.model" : "edu/stanford/nlp/models/ner/german.hgc_175m_600.crf.ser.gz" }, encoding="utf-8")

I cannot parse or recognize the Named Entities correctly that contain non ascii characters. 'Herr Miller fuhr nach Köln.' results in Miller being tagged as Person and Köln as "No Entity" and output is a string not a dict (since json.loads in corenlp.py in line 34 does not work), whereas "Herr Miller fuhr nach Bremen." works!

Any ideas??

Thanks, Johanna

Edit: I changed the language model to german.dewac_175m_600.crf.ser.gz and changes the output to text and then I get some goos results!

jogeiss avatar Jun 29 '16 10:06 jogeiss