cnn-dailymail
cnn-dailymail copied to clipboard
Fixing utf-8 encoding bug
Hi, While running the original file on my system, I get the following error:
Making bin file for URLs listed in url_lists/all_test.txt...
Writing story 0 of 11490; 0.00 percent done
Traceback (most recent call last):
File "make_datafiles.py", line 256, in <module>
write_to_tar(all_test_urls, os.path.join(finished_files_dir, "test.tar"))
File "make_datafiles.py", line 185, in write_to_tar
article_sents, abstract_sents = get_art_abs(story_file)
File "make_datafiles.py", line 109, in get_art_abs
lines = read_story_file(story_file)
File "make_datafiles.py", line 78, in read_story_file
contents = f.read()
File "/home/sampanna/.conda/envs/fast_abs_rl/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 858: ordinal not in range(128)
This PR fixes that.