NLTKRest icon indicating copy to clipboard operation
NLTKRest copied to clipboard

Unicode ecodeError while parsing the PDF files.

Open adityardesai opened this issue 9 years ago • 4 comments

Hi

I am using NLTKRest server to parse few of the PDF files from Polar Trec Data and get the required NER quantities. But for most of the PDF files I am seeing the following error from the REST server.

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128) // Werkzeug Debugger "

Command used is curl -X POST -d "PDF TEXT in STRING" http://localhost:8888/nltk.

Error file is attached as well. nltkrest.txt

adityardesai avatar Apr 23 '16 22:04 adityardesai

yes, thats true @adityardesai you might want to use this patch until its merged https://github.com/chrismattmann/NLTKRest/pull/7 or you could simply build this branch 'encoding-issue' from source

manalishah avatar Apr 23 '16 22:04 manalishah

Thanks for letting us know @manalishah . But I tried the patch given and again same error I am seeing. Am I missing any steps, apart from adding tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py. Any specific build commands to run?

adityardesai avatar Apr 24 '16 00:04 adityardesai

can you upload any one such pdf file that gives you this error? I can replicate the issue and try to resolve it. @adityardesai

manalishah avatar Apr 24 '16 00:04 manalishah

Sure @manalishah . Attached is the sample file. I just added tokenized = nltk.word_tokenize(content.decode("utf-8")) to the server.py and re-run the REST server and again same error. Sample.pdf

adityardesai avatar Apr 24 '16 00:04 adityardesai