seqgen
seqgen copied to clipboard
UnicodeDecodeError when Prepare Data and Vocab
There is UnicodeDecodeError when I run the segment.py;
Actually, I find this error is caused by the data and the code is ok. For example, see the line 87.
And there are valid lines:5495620, error lines:12109, total lines:5507729.
Please check the dataset.
File "preprocess_zh/segment.py", line 119, in
lines = [ x.decode('utf8') for x in open(where).readlines() ]
File "/home/user/xxxx/anaconda3/envs/tf2sks/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3696: invalid continuation byte
lines = []
err_line_ids = []
with open(where, "rb") as fp:
for ii, x in enumerate(fp, 1):
try:
lines.append(x.decode('utf8'))
except:
err_line_ids.append(ii)
# pdb.set_trace()
print("valid lines:{}, error lines:{}, total lines:{}".format(len(lines), len(err_line_ids), len(lines)+len(err_line_ids)))