pympi
pympi copied to clipboard
TextGrids cannot be read if they contain special/IPA characters
Expected behaviour
Read in a textgrid (long format) using: tg = pympi.Praat.TextGrid(path_to_textgrid)
Actual behaviour Throws an AttributeError (included below) and halts if the contents of any interval tier contain non-ASCII characters such as ɪ or ŋ or ɛ. All other TextGrids are imported without issues as expected.
System information
- python version: 3.x (Jupyter Notebook kernel)
- os: Mac OS 13.4.1 (Ventura)
- are you up to date with the latest master?: Yes
Offending notebook cell (which imports any TGs not containing ɛ or ɪ just fine):
for subj in os.listdir(corpus):
for file in os.listdir(os.path.join(corpus,subj)):
if not file.endswith(".TextGrid"):
continue
print(file)
tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))
Full traceback of the issue I am encountering is included below.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[40], line 11
9 continue
10 print(file)
---> 11 tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))
12 for tier in tg.get_tiers():
13 print(tier.name)
File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:44, in TextGrid.__init__(self, file_path, xmin, xmax, codec)
42 else:
43 with open(file_path, 'rb') as f:
---> 44 self.from_file(f, codec)
File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:101, in TextGrid.from_file(self, ifile, codec)
99 # Skip the Headers and empty line
100 next(ifile), next(ifile), next(ifile)
--> 101 self.xmin = float(nn(ifile, regfloat))
102 self.xmax = float(nn(ifile, regfloat))
103 # Skip <exists>
File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:94, in TextGrid.from_file.<locals>.nn(ifile, pat)
92 def nn(ifile, pat):
93 line = next(ifile).decode(codec)
---> 94 return pat.search(line).group(1)
AttributeError: 'NoneType' object has no attribute 'group'
As a small update, this occurs regardless of whether the file's encoding is correctly specified in the codec
parameter of pympi.Praat.TextGrid()
. The files with IPA characters turn out to be in UTF-16 for some reason, whereas all others are in ASCII. But specifying the correct codec doesn't actually solve the issue, whatever it is.