markovify
markovify copied to clipboard
Test fails straight out of the box
I've cloned the repository, and tried running the unittest test.test_itertext. This test doesn't require to set up the sherlock model. It reads the text files that come with the package and makes the models inside the test, so i didn't have any input into it. The error i keep getting is this:
(base) C:\Users\JGC\Desktop\Trabalhos\Python\markovify>python -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 25, in test_from_json_without_retaining
original_model = markovify.Text(f, retain_original=False)
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 37, in test_from_mult_files_without_retaining
models.append(markovify.Text(f, retain_original=False))
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\test\test_itertext.py", line 18, in test_without_retaining
senate_model = markovify.Text(f, retain_original=False)
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\Trabalhos\Python\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\anaconda3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
----------------------------------------------------------------------
Ran 4 tests in 0.725s
FAILED (errors=3)
Running a conda 3.7.6 environment on Windows 10.
Thanks for flagging @JGCoelho. Judging by the error messages, this seems to be an issue with character encoding — possibly tied to Windows and/or Anaconda, but it's hard to tell. If you run the tests with a standard Python installation, instead of Anaconda, do you get the same problem? And can anyone else replicate these errors?
Tried cloning it again and running the unittest with the default python 3.8.2. Same errors:
C:\Users\JGC\Desktop>git clone https://github.com/jsvine/markovify.git
Cloning into 'markovify'...
remote: Enumerating objects: 32, done.
remote: Counting objects: 100% (32/32), done.
remote: Compressing objects: 100% (30/30), done.
remote: Total 834 (delta 16), reused 10 (delta 2), pack-reused 802
Receiving objects: 100% (834/834), 461.29 KiB | 1.43 MiB/s, done.
Resolving deltas: 100% (495/495), done.
C:\Users\JGC\Desktop>cd markovify
C:\Users\JGC\Desktop\markovify>py --version
Python 3.8.2
C:\Users\JGC\Desktop\markovify>py -m unittest test.test_itertext
EE.E
======================================================================
ERROR: test_from_json_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 24, in test_from_json_without_retaining
original_model = markovify.Text(f, retain_original=False)
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
======================================================================
ERROR: test_from_mult_files_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 36, in test_from_mult_files_without_retaining
models.append(markovify.Text(f, retain_original=False))
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
======================================================================
ERROR: test_without_retaining (test.test_itertext.MarkovifyTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "C:\Users\JGC\Desktop\markovify\test\test_itertext.py", line 17, in test_without_retaining
senate_model = markovify.Text(f, retain_original=False)
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 53, in __init__
parsed = parsed_sentences or self.generate_corpus(input_text)
File "C:\Users\JGC\Desktop\markovify\markovify\text.py", line 152, in generate_corpus
for line in text:
File "C:\Users\JGC\AppData\Local\Programs\Python\Python38\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3552: character maps to <undefined>
----------------------------------------------------------------------
Ran 4 tests in 0.515s
FAILED (errors=3)
Maybe a problem with codecs? Opening the files sherlock.txt and senate-bills.txt i could see that they had the format utf-8 without BOM. Converted them to utf-8 with BOM and got the same error. Also converted the format to ANSI and UCS-2 to no avail.
Also, the character 0x9d is the 'RIGHT DOUBLE QUOTATION MARK' (U+201D) ” 0x9D.
0x9d is unmapped in windows-1252 according to wikipedia