python-spark-tutorial
python-spark-tutorial copied to clipboard
dashes in word_count.txt cause errors with WordCount.py
Issue:
Thendash
characters in word_count.txt
cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74.
" and here: "near–bankruptcy
".
To Recreate:
using spark-2.3.2-bin-hadoop2.7
on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you cloned python-spark-tutorial
and run the following from lecture 6:
spark-submit ./rdd/WordCount.py
The execution halts about halfway through the frequency counter with the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)
Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.
Work-Around:
I changed the two ndash
characters to "from 1913-74.
" and "near-bankruptcy
", which solved the issue for me. Related stackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.
just import on the top resolve the issue import sys reload(sys) sys.setdefaultencoding("utf-8")