pylangacq
pylangacq copied to clipboard
ValueError: cannot align the utterance and %mor tiers (v2)
similar error as https://github.com/jacksonllee/pylangacq/issues/23 but this issue remains even after 0.19.1
Could we have a mode ** "best-effort" ** that returns an error field in each failing utterance instead of crashing for the whole file, so that we can at least use partial data? (in addition to fixing this bug)
Describe the bug
import pylangacq
file_cha = f"{paths.dir_talkbank_media}/fluency/UMD-CMU/Control/205DM_parent_y1.cha"
reader = pylangacq.read_chat(file_cha) # crash
Relevant CHILDES or TalkBank data fluency/UMD-CMU/Control/205DM_parent_y1.cha
Additional context
concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
r = call_item.fn(*call_item.args, **call_item.kwargs)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
return [fn(*args) for args in chunk]
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
return [fn(*args) for args in chunk]
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1455, in _parse_chat_str
utterances = self._get_utterances(all_tiers)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1510, in _get_utterances
raise ValueError(
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'ADU': 'okay I think it;s time for our next game . \x151741207_1746675\x15', '%mor': 'adj|okay pro:sub|I v|think pro:per|it n:let|s n|time prep|for det:poss|our adj|next n|game .', '%gra': '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|6|MOD 6|4|POBJ 7|6|NJCT 8|10|DET 9|10|MOD 10|7|POBJ 11|3|PUNCT'}
Cleaned-up utterance --
okay I think it;s time for our next game .
Parsed %mor tier --
['adj|okay', 'pro:sub|I', 'v|think', 'pro:per|it', 'n:let|s', 'n|time', 'prep|for', 'det:poss|our', 'adj|next', 'n|game', '.']
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/pathto/talkbank_utils.py", line 363, in bug_D20240328T013624
reader = pylangacq.read_chat(file_cha)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
return func(*args, **kwargs)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1887, in read_chat
return cls.from_files([path], match=match, exclude=exclude, encoding=encoding)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
return func(*args, **kwargs)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1034, in from_files
return cls.from_strs(strs, paths, parallel=parallel)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
return func(*args, **kwargs)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 995, in from_strs
reader._parse_chat_strs(strs, ids, parallel)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 264, in _parse_chat_strs
self._files = collections.deque(
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
for element in iterable:
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
yield _result_or_cancel(fs.pop())
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
return fut.result(timeout)
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'ADU': 'okay I think it;s time for our next game . \x151741207_1746675\x15', '%mor': 'adj|okay pro:sub|I v|think pro:per|it n:let|s n|time prep|for det:poss|our adj|next n|game .', '%gra': '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|6|MOD 6|4|POBJ 7|6|NJCT 8|10|DET 9|10|MOD 10|7|POBJ 11|3|PUNCT'}
Cleaned-up utterance --
okay I think it;s time for our next game .
Parsed %mor tier --
['adj|okay', 'pro:sub|I', 'v|think', 'pro:per|it', 'n:let|s', 'n|time', 'prep|for', 'det:poss|our', 'adj|next', 'n|game', '.']
In this particular instance, the issue is the misalignment between it;s from the utterance (a typo for it's?) and pro:per|it n:let|s from the %mor tier (n:let|s is likely simply wrong -- possibly caused by the typo from the utterance, which would throw off the automatic morphosyntactic tagger from CHILDES for generating the %mor tier). Since pro:per|it n:let|s from %mor has no indication whatsoever for the presence of a clitic or contraction, there's no way pylangacq can correctly parse this utterance.
I had been refraining from implementing what you call the "best-effort" strategy, but because data and its annotations are far from being perfect, perhaps it's time for me to do so. :-) Stay tuned!
Thanks, that would be very helpful. Also, any idea why a standard format (eg json or yaml) isn't being adopted instead? it would be a one-time job to convert everything from talkbank to json format and could be done without loss of information