theanets
theanets copied to clipboard
Handle unicode text
Attempting to use theanets.recurrent.Text
on a UTF8 encoded corpus used to give an error
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
/home/fl350/bachbot/scripts/theanet/theanet.py in <module>()
24 with codecs.open(path, 'r', 'utf-8') as handle:
25 file_data = handle.read().lower()
---> 26 text = theanets.recurrent.Text(file_data[:int(VAL_FRACTION*len(file_data))])
27 text_val = theanets.recurrent.Text(file_data[int(VAL_FRACTION*len(file_data)):])
28
/home/fl350/theanets/theanets/recurrent.py in __init__(self, text, alpha, min_count, unknown)
89 collections.Counter(text).items()
90 if char != unknown and count >= min_count)))
---> 91 print type(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'))
92 self.text = re.sub(r'[^{}]'.format(re.escape(self.alpha)).encode('utf8'), unknown, text)
93 assert unknown not in self.alpha
UnicodeEncodeError: 'ascii' codec can't encode character u'\x83' in position 85: ordinal not in range(128)
This is fixed by this PR.
Coverage decreased (-0.1%) to 94.768% when pulling eaca4337d972edfe1d44a93e2d93701dbab98766 on feynmanliang:text-handle-utf into b637b01bc4f1ef69fda9a23f5637462a1188ebdb on lmjohns3:master.
This can get pretty tricky with text encodings. My preference is to always operate with unicode, because then iterating over a string is guaranteed to iterate over a "letter" instead of iterating over parts of multi-byte characters. That said, I haven't been very careful about enforcing this!
This is additionally complicated by the fact that Py2 and Py3 have different defaults for handling strings. I personally use Py3 but I try to test everything with Py2 as well (see the Travis config).
Which version of Python are you using? Can you try using a "unicode" object instead of a UTF-8 encoded byte sequence to see if this problem persists? Can you add a test to run a unicode object through the recurrent infrastructure and add it to this PR? Also, this PR breaks an existing test, please fix.
Thanks for taking a look, I will push some changes soon to address the issues
- I'm using 2.7.3
- I can repro with the following code (assuming
path
points to a file with utf8 encoded strings)
with codecs.open(path, 'r', 'utf-8') as handle:
file_data = handle.read().lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])
or using a unicode
object
with open(path, 'r') as handle:
file_data = unicode(handle.read(), 'utf-8').lower()
text = theanets.recurrent.Text(file_data[:int(TRAIN_FRACTION*len(file_data))])
text_val = theanets.recurrent.Text(file_data[int(TRAIN_FRACTION*len(file_data)):])