epitran icon indicating copy to clipboard operation
epitran copied to clipboard

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 970: character maps to <undefined>

Open Looki2000 opened this issue 10 months ago • 3 comments
trafficstars

After installing the library with pip and trying to initialize it, I'm getting the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\epitran\_epitran.py", line 39, in __init__
    self.epi = SimpleEpitran(code, preproc, postproc, ligatures, rev, rev_preproc, rev_postproc, tones=tones)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\epitran\simple.py", line 46, in __init__
    self.ft = panphon.FeatureTable()
              ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\panphon\featuretable.py", line 62, in __init__
    self.segments, self.seg_dict, self.names = self._read_bases(bases_fn, self.weights)
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\site-packages\panphon\featuretable.py", line 81, in _read_bases
    header = next(reader)
             ^^^^^^^^^^^^
  File "C:\Users\elect\AppData\Local\Programs\Python\Python312\Lib\encodings\cp1250.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 970: character maps to <undefined>

epi = epitran.Epitran("eng-Latn") It does not matter what language.

Python 3.12.5 epitran 1.25.1

Looki2000 avatar Jan 05 '25 17:01 Looki2000

Just to confirm that I have exactly the same issue with "deu-Latn"

nicloay avatar Jan 07 '25 22:01 nicloay

Has this problem solved?

tlemangen avatar Jan 26 '25 17:01 tlemangen

My error is: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa3 in position 7832: illegal multibyte sequence. So I modified panphon's code to force the file to be read using utf-8 encoding:

  1. Open the panphon/featuretable.py file.
  2. Find the _read_bases function, it should be in line 76.
  3. Modify the Open () function and specify the encoding to be utf-8, like
with open(fn, encoding='utf-8') as f:
    reader = csv.reader(f)
    header = next(reader)
    ...

This solution is effective for me, but I think it is temporary.

tlemangen avatar Jan 26 '25 17:01 tlemangen

I'm having the same issue on Windows 11 with the most recent Python and Epitran versions.

SpectralPixel avatar May 03 '25 12:05 SpectralPixel