epitran
epitran copied to clipboard
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 970: character maps to <undefined>
After installing the library with pip and trying to initialize it, I'm getting the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\epitran\_epitran.py", line 39, in __init__
self.epi = SimpleEpitran(code, preproc, postproc, ligatures, rev, rev_preproc, rev_postproc, tones=tones)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\epitran\simple.py", line 46, in __init__
self.ft = panphon.FeatureTable()
^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\panphon\featuretable.py", line 62, in __init__
self.segments, self.seg_dict, self.names = self._read_bases(bases_fn, self.weights)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\panphon\featuretable.py", line 81, in _read_bases
header = next(reader)
^^^^^^^^^^^^
File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 970: character maps to <undefined>
epi = epitran.Epitran("eng-Latn")
It does not matter what language.
Python 3.12.5 epitran 1.25.1
Just to confirm that I have exactly the same issue with "deu-Latn"
Has this problem solved?
My error is: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa3 in position 7832: illegal multibyte sequence.
So I modified panphon's code to force the file to be read using utf-8 encoding:
- Open the
panphon/featuretable.pyfile. - Find the
_read_basesfunction, it should be in line 76. - Modify the
Open ()function and specify the encoding to beutf-8, like
with open(fn, encoding='utf-8') as f:
reader = csv.reader(f)
header = next(reader)
...
This solution is effective for me, but I think it is temporary.
I'm having the same issue on Windows 11 with the most recent Python and Epitran versions.