pympi icon indicating copy to clipboard operation
pympi copied to clipboard

TextGrids cannot be read if they contain special/IPA characters

Open mfaytak opened this issue 10 months ago • 1 comments

Expected behaviour Read in a textgrid (long format) using: tg = pympi.Praat.TextGrid(path_to_textgrid)

Actual behaviour Throws an AttributeError (included below) and halts if the contents of any interval tier contain non-ASCII characters such as ɪ or ŋ or ɛ. All other TextGrids are imported without issues as expected.

System information

  • python version: 3.x (Jupyter Notebook kernel)
  • os: Mac OS 13.4.1 (Ventura)
  • are you up to date with the latest master?: Yes

Offending notebook cell (which imports any TGs not containing ɛ or ɪ just fine):

for subj in os.listdir(corpus):
    for file in os.listdir(os.path.join(corpus,subj)):
        if not file.endswith(".TextGrid"):
            continue
        print(file)
        tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))

Full traceback of the issue I am encountering is included below.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[40], line 11
      9     continue
     10 print(file)
---> 11 tg = pympi.Praat.TextGrid(os.path.join(corpus,subj,file))
     12 for tier in tg.get_tiers():
     13     print(tier.name)

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:44, in TextGrid.__init__(self, file_path, xmin, xmax, codec)
     42 else:
     43     with open(file_path, 'rb') as f:
---> 44         self.from_file(f, codec)

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:101, in TextGrid.from_file(self, ifile, codec)
     99 # Skip the Headers and empty line
    100 next(ifile), next(ifile), next(ifile)
--> 101 self.xmin = float(nn(ifile, regfloat))
    102 self.xmax = float(nn(ifile, regfloat))
    103 # Skip <exists>

File ~/miniconda3/envs/cameroon/lib/python3.11/site-packages/pympi/Praat.py:94, in TextGrid.from_file.<locals>.nn(ifile, pat)
     92 def nn(ifile, pat):
     93     line = next(ifile).decode(codec)
---> 94     return pat.search(line).group(1)

AttributeError: 'NoneType' object has no attribute 'group'

mfaytak avatar Sep 19 '23 23:09 mfaytak

As a small update, this occurs regardless of whether the file's encoding is correctly specified in the codec parameter of pympi.Praat.TextGrid(). The files with IPA characters turn out to be in UTF-16 for some reason, whereas all others are in ASCII. But specifying the correct codec doesn't actually solve the issue, whatever it is.

mfaytak avatar Sep 20 '23 16:09 mfaytak