pylangacq icon indicating copy to clipboard operation
pylangacq copied to clipboard

Error on reading Colaje corpus

Open gpirrotta opened this issue 2 years ago • 2 comments

Hi there! I was trying to load the colaje corpus, but got the following exception:

  File "/Users/giovanni/.pyenv/versions/3.9.7/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/Users/giovanni/.pyenv/versions/3.9.7/lib/python3.9/concurrent/futures/process.py", line 202, in _process_chunk
    return [fn(*args) for args in chunk]
  File "/Users/giovanni/.pyenv/versions/3.9.7/lib/python3.9/concurrent/futures/process.py", line 202, in <listcomp>
    return [fn(*args) for args in chunk]
  File "/Users/giovanni/.pyenv/versions/vs-code-python-env/lib/python3.9/site-packages/pylangacq/chat.py", line 1430, in _parse_chat_str
    utterances = self._get_utterances(all_tiers)
  File "/Users/giovanni/.pyenv/versions/vs-code-python-env/lib/python3.9/site-packages/pylangacq/chat.py", line 1477, in _get_utterances
    raise ValueError(
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'CHI': '+< la momie [/] la momie [/] la momie xx .\x152711210_2716463\x15', '%mor': 'det|la&FEM&SING det|la&FEM&SING det|la&FEM&SING n|momie .', '%pho': 'la momi la momi la momi X', '%act': 'CHI tourne sur elle même avec son jouet .'}
Cleaned-up utterance --
la la la momie xx .
"""

The above exception was the direct cause of the following exception:

ValueError                                Traceback (most recent call last)
/var/folders/n_/3wzv2cmd5bl17g580ys1rhlh0000gp/T/ipykernel_76765/3973281702.py in <module>
      1 url = "/Users/giovanni/projects/french-linguistics/data-colaje/colaje.zip"
----> 2 chat = pylangacq.read_chat(url,'ANAE')
.
.
.
.
ValueError: cannot align the utterance and %mor tiers:
Tiers --
{'CHI': '+< la momie [/] la momie [/] la momie xx .\x152711210_2716463\x15', '%mor': 'det|la&FEM&SING det|la&FEM&SING det|la&FEM&SING n|momie .', '%pho': 'la momi la momi la momi X', '%act': 'CHI tourne sur elle même avec son jouet .'}
Cleaned-up utterance --
la la la momie xx . 

It seems like the utterance and %mor tiers indeed don't align, but is there a way of reading the file anyways? Thanks for your help!

gpirrotta avatar Mar 17 '22 06:03 gpirrotta

Hello, I've started looking into this. With the colaje corpus data, I was able to reproduce the reported alignment issue. I saw misalignment due to xx and yy which the current pylangacq v0.16.2 doesn't ignore but should. That would be a relatively easy fix, but then I also encountered trickier issues, such as the definite articles le/la with the vowel elided. For instance, l'instant from an utterance is represented at the %mor tier as det|le/la&SING n|instant, which is linguistically correct but would expect two (not one) items to align to from the utterance. In this case, I'd think pylangacq should ideally align l' to det|le/la&SING, and then instance to n|instant to resolve the alignment issue. I'll need to dig into this more for implementation and its broader implications -- more updates to come.

jacksonllee avatar Mar 18 '22 04:03 jacksonllee

Ok, I stay hold for updates. Thanks for your support!

gpirrotta avatar Mar 22 '22 13:03 gpirrotta

Hello, apologies for the long silence. TL;DR -- I've looked into the Colaje corpus data. I've also just made a new release of pylangacq (v0.19.0) for some coincidentally related reasons (about clitics). Unfortunately, the Colaje corpus isn't compatible with how clitics are handled in the most recent CHAT data format, and therefore wouldn't be able to be read by pylangacq as-is.

For the long story -- I downloaded the "ANAE" portion of the Colaje corpus from this link, which also shows its CHAT data files were last updated in 2016. Somewhat related to the treatment of clitics such as French le/la becoming l' before a vowel, as I brought up in my previous comment above, the CHAT data format from TalkBank / CHILDES appears to have been updated recently, such that the %mor tiers now distinguish pre-clitics (delimited by $) versus post-clitics (delimited by ~), e.g., l'aiguille from an utterance corresponds to det|le-Def-Art$noun|aiguille&Fem on %mor (note $ linking det|le-Def-Art for l' and noun|aiguille&Fem for aiguille; example from CHILDES). This update from CHAT has resulted in the latest release of pylangacq v0.19.0 for compatibility with the official TalkBank / CHILDES datasets. Unfortunately, this means that the Colaje data prepared several years ago wouldn't be compatible with pylangacq; specifically, something like det|le/la&SING n|instant from Colaje's %mor for l'instant doesn't use the $ delimiter, so there's no way pylangacq could tell from %mor that there's a clitic requiring special treatment (and I didn't want to start adding language-specific code in pylangacq, which would open a can of worms). For pylangacq to read the Colaje data, the maintainers of the Colaje corpus would have to update the corpus to conform to the latest CHAT format.

jacksonllee avatar Dec 13 '23 06:12 jacksonllee