jusText
jusText copied to clipboard
UnicodeDecodeError when crawling pages
I've installed JusText on a Windows 2012 Server machine and it seems to be running fine overall. However, about 30-40% of the HTML files crash because of encoding issues. The error message I get it:
File "c:\python32\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 20410: character maps to <undefined>
and then the TXT file is empty for the HTML file that I'm trying to do JusText on.
An example of a page that is causing it to crash: http://www.democracynow.org/2012/7/6/peru_declares_state_of_emergency_as (byte position 20410, the word GONZÁLEZ). I've saved a copy of the file that I'm trying to do JusText on at:
I've tried every possible combination of
--encoding=...
--enc-force
--enc-errors=...
as well as every possible encoding on the files, and it's still crashing on these files. Any suggestions?
Thanks so much for your help.
Mark Davies, mark_davies (at) byu.edu Professor of Linguistics / Brigham Young University http://davies-linguistics.byu.edu/
** Corpus design and use // Linguistic databases ** ** Historical linguistics // Language variation ** ** English, Spanish, and Portuguese **