jusText icon indicating copy to clipboard operation
jusText copied to clipboard

UnicodeDecodeError when crawling pages

Open miso-belica opened this issue 8 years ago • 0 comments

I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The error message I get it:

File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 20410: character maps to <undefined>

and then the TXT file is empty for the HTML file that I'm trying to do JusText on.

An example of a page that is causing it to crash: http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as (byte position 20410, the word GONZÁLEZ). I've saved a copy of the file that I'm trying to do JusText on at:

I've tried every possible combination of

--encoding=... --enc-force
--enc-errors=...

as well as every possible encoding on the files, and it's still crashing on these files. Any suggestions?

Thanks so much for your help.

Mark Davies, mark_davies (at) byu.edu Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/

** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese **

miso-belica avatar Jun 10 '16 13:06 miso-belica