Praat-textgrids
Praat-textgrids copied to clipboard
Creating TextGrid object fails for large TextGrid files
I'm creating a TextGrid object from a file like so
try: grid = textgrids.TextGrid(arg)
Which works well for smaller files (~300KB), but fails with the following error message for longer files ~(1.7MB-3MB)
Traceback (most recent call last): File "***", line 9, in <module> grid = textgrids.TextGrid(arg) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 151, in __init__ self.read(self.filename) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 402, in read self.parse(data) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 288, in parse self._parse_long(buff) File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 359, in _parse_long x0, x1 = [float(grab(s)) for s in data[p:p + 2]] File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 359, in <listcomp> x0, x1 = [float(grab(s)) for s in data[p:p + 2]] File "***/.virtualenvs/test/lib/python3.9/site-packages/textgrids/__init__.py", line 339, in <lambda> grab = lambda s: s.split(' = ')[1] IndexError: list index out of range
Might that be a bug in the library, or some Python limitation?
Hi, I m also a user. I have had the same issue, with a 500ko textgrid. I suspect the issue occurs in the parse function. The line that causes issues is
grab = lambda s: s.split(' = ')[1]
But there are many limitations to the things I can do to try and bypass the issue: First it seems to me that "defining" the grab command may not be the cause of the issue, rather its use in the very next line. My lack of knowledge of the functions and the ways to track down errors in them makes me helpless in such situations. Second, there could somehow be a "quick fix", if we knew what are the limitations in size that trigger such issue. @Legisign : this may be distant memory, but it would be of greeeaaaat help to us if such information was accessible.
That’s curious. That line only defines a function; I could have used the longer form with def but since this is only to take the key of a key = value pair it seemed good enough to use a lambda.
However, I could understand if calling the function caused a crash unless the string supplied as a parameter is of the form key = value. And that could mean one of two things:
- the file is detected as a long-form textgrid file when it’s actually a short-form one, or
- the parser tries to read a
key = valuepair when one is not coming up in the input.
Part of the difficulty is that the Praat developers didn’t think it worth while to have distinct headers for the long and short-form textgrid files; both have file type "ooTextFile" and object class "TextGrid".
Dear Tommi, I have made a MWE with textgrid attached. Should I e-mail them to you ? Also, see my responses below. Hopefully this is nothing and the package will work for every type :) (btw I use it and I try to test it, so far only one issue that we can discuss later)
That’s curious. That line only defines a function; I could have used the longer form with
defbut since this is only to take the key of akey = valuepair it seemed good enough to use alambda.However, I could understand if calling the function caused a crash unless the string supplied as a parameter is of the form
key = value. And that could mean one of two things:1. the file is detected as a long-form textgrid file when it’s actually a short-form one, or
This is very unlikely because the package works ceteris paribus for small-size textgrids
2. the parser tries to read a `key = value` pair when one is not coming up in the input.
I've tried to see how "grab" works, seems simple enough, but I am not very familiar as to why the error is spotted line 372, where the grab tool is defined, and not when it is used. But again I am not a pro with debugging packages / functions
Part of the difficulty is that the Praat developers didn’t think it worth while to have distinct headers for the long and short-form textgrid files; both have file type
"ooTextFile"and object class"TextGrid".
I myself don't really understand why there should be a need for two types of files, but anyway. I don't have the history.
Cheers anyway :)
I have made a MWE with textgrid attached. Should I e-mail them to you ?
That might be helpful…
I think there might be a third option too. As the value is converted to float as soon as it is read, it might trigger an error if it tried to read a non-numeric value. However, that seems unlikely too.
I didn’t actually expect anyone to come up with a huge textgrid file as they tend to be significantly smaller than the sound files. The script basically just reads them in full instead of carefully reading line-by-line or chunk-by-chunk. In my setting (using a physical computer instead of, say, a virtual environment) it could gobble whatever I tried to feed it.
I myself don't really understand why there should be a need for two types of files
I quite agree. In fact, both text file formats are poorly designed. The short form was probably meant for conserving some space but retaining legibility for humans: it’s actually quite close to the binary format in how it’s structured. The long form, on the other hand, is very readable for humans but a pain in the a*se to parse programmatically. No doubt the creators of Praat cannot change the format to JSON or XML or whatever any longer because everyone already has so many text-form textgrids lying around.
I didn’t actually expect anyone to come up with a huge textgrid file as they tend to be significantly smaller than the sound files. The script basically just reads them in full instead of carefully reading line-by-line or chunk-by-chunk. In my setting (using a physical computer instead of, say, a virtual environment) it could gobble whatever I tried to feed it.
I'm running my script on bare metal too. My TextGrid files are transcriptions of sociolinguistic interviews of about an hour in length. With four transcription tiers I end up with files between 1 and 3+MB.
For the time being, I'll try to split the files into smaller parts.
I myself don't really understand why there should be a need for two types of files
I guess only the praat developers know that.
I looked at it in the weekend and it’s a tough call.
- There’s very little sense in iterating over the file (instead of reading it in one chunk), because the textgrid is organized so that the outer loop consists of tiers which are (usually) as long as the whole textgrid. The intervals and points you might reasonably expect to read item-by-item are in the inner loop.
- And in any case, iterating over the file would mean the basic structure would need to be changed. Now the whole textgrid is one
OrderedDictwhich can be searched and manipulated as anydictcan. If it were only a window into the textgrid, one would need different containers. - Also, I began to suspect that’s what really the issue isn’t so much the memory requirement of the data structures themselves but the fact that memory is simultaneously needed for (a) the unparsed file, (b) the
OrderedDictas it is being built, and (c) the temporary variables and structures that are used in the parsing.
So yeah, it’s not optimized in any way, but then again, I’m not a real programmer myself :) Huge files might need an altogether different kind of implementation.
I looked at it in the weekend and it’s a tough call.
* There’s very little sense in iterating over the file (instead of reading it in one chunk), because the textgrid is organized so that the outer loop consists of tiers which are (usually) as long as the whole textgrid. The intervals and points you might reasonably expect to read item-by-item are in the inner loop. * And in any case, iterating over the file would mean the basic structure would need to be changed. Now the whole textgrid is one `OrderedDict` which can be searched and manipulated as any `dict` can. If it were only a window into the textgrid, one would need different containers. * Also, I began to suspect that’s what really the issue isn’t so much the memory requirement of the data structures themselves but the fact that memory is simultaneously needed for (a) the unparsed file, (b) the `OrderedDict` as it is being built, and (c) the temporary variables and structures that are used in the parsing.
I agree, It probably isn't a matter of global size, but a temporary variable.
So yeah, it’s not optimized in any way, but then again, I’m not a real programmer myself :) Huge files might need an altogether different kind of implementation. Totally cool. A warning for big TextGrids, good practices in TextGrid file size, and it'll work just fine. I wish I could help, but I'd need to learn a bit more about programming. Maybe a pointer (if it exists in python, I don't even know that) could solve the issue, dunno...