python-spark-tutorial icon indicating copy to clipboard operation
python-spark-tutorial copied to clipboard

dashes in word_count.txt cause errors with WordCount.py

Open HarryCaveMan opened this issue 6 years ago • 1 comments

Issue:

Thendash characters in word_count.txt cause an error when following the "Run your first Spark Job" tutorial. There are only two occurences of this character here: "from 1913–74." and here: "near–bankruptcy".

To Recreate:

using spark-2.3.2-bin-hadoop2.7 on Ubuntu18, pyspark/python 2.7, Installed following instructions from lecture 5, go to directory where you cloned python-spark-tutorial and run the following from lecture 6:

spark-submit ./rdd/WordCount.py

The execution halts about halfway through the frequency counter with the following error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position4: ordinal not in range(128)

Spoiler, it's the dash. I'm not sure whether or not the utf16 dash was intentional, so I'm posting.

Work-Around:

I changed the two ndash characters to "from 1913-74." and "near-bankruptcy", which solved the issue for me. Related stackoverflow thread where someone else ran into a similar problem with python2.7 and used the same solution.

HarryCaveMan avatar Nov 04 '18 05:11 HarryCaveMan

just import on the top resolve the issue import sys reload(sys) sys.setdefaultencoding("utf-8")

kashikhan1 avatar Nov 24 '18 22:11 kashikhan1