seqgen icon indicating copy to clipboard operation
seqgen copied to clipboard

UnicodeDecodeError when Prepare Data and Vocab

Open youngornever opened this issue 5 years ago • 0 comments

There is UnicodeDecodeError when I run the segment.py; Actually, I find this error is caused by the data and the code is ok. For example, see the line 87. And there are valid lines:5495620, error lines:12109, total lines:5507729. image Please check the dataset.

File "preprocess_zh/segment.py", line 119, in lines = [ x.decode('utf8') for x in open(where).readlines() ] File "/home/user/xxxx/anaconda3/envs/tf2sks/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3696: invalid continuation byte

lines = []
err_line_ids = []
with open(where, "rb") as fp:
    for ii, x in enumerate(fp, 1):
        try:
            lines.append(x.decode('utf8'))
        except:
            err_line_ids.append(ii)
            # pdb.set_trace()    
print("valid lines:{}, error lines:{}, total lines:{}".format(len(lines), len(err_line_ids), len(lines)+len(err_line_ids)))

youngornever avatar Aug 14 '20 17:08 youngornever