pylangacq ValueError: cannot align the utterance and %mor tiers (v2)

similar error as https://github.com/jacksonllee/pylangacq/issues/23 but this issue remains even after 0.19.1

Could we have a mode ** "best-effort" ** that returns an error field in each failing utterance instead of crashing for the whole file, so that we can at least use partial data? (in addition to fixing this bug)

Describe the bug

import pylangacq
file_cha = f"{paths.dir_talkbank_media}/fluency/UMD-CMU/Control/205DM_parent_y1.cha"
reader = pylangacq.read_chat(file_cha) # crash

Relevant CHILDES or TalkBank data fluency/UMD-CMU/Control/205DM_parent_y1.cha

Additional context

concurrent.futures.process._RemoteTraceback:
"""
Traceback (most recent call last):
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 205, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1455, in _parse_chat_str
    utterances = self._get_utterances(all_tiers)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1510, in _get_utterances
    raise ValueError(
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'ADU': 'okay I think it;s time for our next game . \x151741207_1746675\x15', '%mor': 'adj|okay pro:sub|I v|think pro:per|it n:let|s n|time prep|for det:poss|our adj|next n|game .', '%gra': '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|6|MOD 6|4|POBJ 7|6|NJCT 8|10|DET 9|10|MOD 10|7|POBJ 11|3|PUNCT'}
Cleaned-up utterance --
okay I think it;s time for our next game .
Parsed %mor tier --
['adj|okay', 'pro:sub|I', 'v|think', 'pro:per|it', 'n:let|s', 'n|time', 'prep|for', 'det:poss|our', 'adj|next', 'n|game', '.']
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/pathto/talkbank_utils.py", line 363, in bug_D20240328T013624
    reader = pylangacq.read_chat(file_cha)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1887, in read_chat
    return cls.from_files([path], match=match, exclude=exclude, encoding=encoding)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 1034, in from_files
    return cls.from_strs(strs, paths, parallel=parallel)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 197, in wrapper
    return func(*args, **kwargs)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 995, in from_strs
    reader._parse_chat_strs(strs, ids, parallel)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/site-packages/pylangacq/chat.py", line 264, in _parse_chat_strs
    self._files = collections.deque(
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/process.py", line 575, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/timothee/.conda/envs/speakerid_cuda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'ADU': 'okay I think it;s time for our next game . \x151741207_1746675\x15', '%mor': 'adj|okay pro:sub|I v|think pro:per|it n:let|s n|time prep|for det:poss|our adj|next n|game .', '%gra': '1|3|LINK 2|3|SUBJ 3|0|ROOT 4|3|OBJ 5|6|MOD 6|4|POBJ 7|6|NJCT 8|10|DET 9|10|MOD 10|7|POBJ 11|3|PUNCT'}
Cleaned-up utterance --
okay I think it;s time for our next game .
Parsed %mor tier --
['adj|okay', 'pro:sub|I', 'v|think', 'pro:per|it', 'n:let|s', 'n|time', 'prep|for', 'det:poss|our', 'adj|next', 'n|game', '.']

Apr 09 '24 00:04 timotheecour

In this particular instance, the issue is the misalignment between it;s from the utterance (a typo for it's?) and pro:per|it n:let|s from the %mor tier (n:let|s is likely simply wrong -- possibly caused by the typo from the utterance, which would throw off the automatic morphosyntactic tagger from CHILDES for generating the %mor tier). Since pro:per|it n:let|s from %mor has no indication whatsoever for the presence of a clitic or contraction, there's no way pylangacq can correctly parse this utterance.

I had been refraining from implementing what you call the "best-effort" strategy, but because data and its annotations are far from being perfect, perhaps it's time for me to do so. :-) Stay tuned!

Apr 09 '24 07:04 jacksonllee

Thanks, that would be very helpful. Also, any idea why a standard format (eg json or yaml) isn't being adopted instead? it would be a one-time job to convert everything from talkbank to json format and could be done without loss of information

Apr 09 '24 17:04 timotheecour

pylangacq pylangacq copied to clipboard

ValueError: cannot align the utterance and %mor tiers (v2)

pylangacq
pylangacq copied to clipboard