pylangacq icon indicating copy to clipboard operation
pylangacq copied to clipboard

Removing words starting with 0

Open milamarcheva opened this issue 5 months ago • 3 comments

Removing Omitted Words, annotated as 0word, because they were added by the annotator and are not part of the authentic child produced speech.

  • [done] Add a concise title to this pull request on the GitHub web interface.

  • [done ] Add a description in this box to describe what this pull request is about.

  • [ ] If code behavior is being updated (e.g., a bug fix), relevant tests should be added.

  • [ ] The CircleCI builds should pass, including both the code styling checks by black and flake8 as well as the test suite.

  • [ ] Add an entry to CHANGELOG.md at the repository's root level.

milamarcheva avatar Jan 10 '24 17:01 milamarcheva

Hello @milamarcheva, thank you for making this pull request! I haven't thought about how or whether to handle words annotated with a preceding 0, so this is a good opportunity to reflect on this.

Thank you for indicating the source data (CHILDES -> Biling -> Perez -> Shelia) for an instance of a 0-word. I've taken a look at the CHAT data file to see if there are clues to help decide what to do with these 0-words. I found the occurrence of "*CHI: I 0am done ." (utterance #136 in the data file), but this file has the transcribed utterances only and doesn't have dependent tiers such as %mor and %gra. I spot-checked other CHILDES datasets, and found a 0-word instance with %mor and %gra: https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Eve/010600a.cha, in utterance #2:

MOT: you 0v more cookies ?
%mor: pro:per|you 0v|v qn|more n|cookie-PL ?
%gra: 1|2|SUBJ 2|0|ROOT 3|4|QUANT 4|2|OBJ 5|2|PUNCT

In this example, the "0v" in the utterance corresponds to "0v|v" in the %mor tier and to "2|0|ROOT" in %gra. This example suggests that although the 0-words aren't part of the produced speech or are inaudible somehow (as you've also pointed out), they still have a role in other annotation tiers. For pylangacq, an important goal is to correctly align the pieces across an utterance and its associated %mor and %gra tiers (if available) to create the parsed tokens. I see you're proposing to remove these 0-words in your commit, but given this example of "0v" from the American English Brown dataset, it would seem like pylangacq should not drop these 0-words, or else it wouldn't be able to align the utterance with the %mor and %gra tiers.

To think out loud a bit more -- If a code change is needed within pylangacq, what are the options? I see the following:

  1. Drop the 0-words as you've proposed, but the problem is that pylangacq wouldn't correctly align the utterances with %mor and %gra tiers, as explained above.
  2. Keep the 0-words, but just remove the "0"? Not good, since there would be no indication that these words either aren't in the actual speech or are inaudible.
  3. Keep the 0-words, but remove the "0" and find another way to indicate the non-existence of these words. But what way? Is this the purpose of the "0" in the first place?
  4. Do nothing. If these untreated 0-words affect a pylangacq user, then the user has to handle these 0-words on their own. For instance, if a user is interested in word count in general, then the 0-words slightly inflate the word count numbers, in which case the user could detect and subtract these 0-words.

Option 1 is a deal breaker for pylangacq. Options 2 and 3 don't make sense. So I'm leaning towards option 4 for no code change needed.

Am I missing something? Let me know what you think, and thank you again for raising the issue!

jacksonllee avatar Jan 11 '24 02:01 jacksonllee

Dear Jackson,

Thank you very much for your response. I am a 1st year PhD student and I only recently discovered pylangacq, even though I have been working with CHILDES data for 2 years, so I firstly want to thank you for writing the library, it's very useful to my work.

It's fine if you think that the change I proposed does not fit with the goals of pylangacq. I found several other issue with the data cleaning from the annotations:

  • +/ -- interruption, the current library leaves the slash in the processed string. I could attempt to fix that
  • [/] -- repetition; [//] -- retracing: the library removes any repeated words or phrases, but I think an option to leave them behind might be useful in some cases, when focusing on production

This is my first time attempting to contribute to an open source library and I was compelled to do it because it's a very useful library for me, so I wanted to fix some cases of what I thought were issues, but if my ideas do not comply with the original concept of the library that's fine.

Best, Mila

On Thu, 11 Jan 2024 at 02:43, Jackson L. Lee @.***> wrote:

Hello @milamarcheva https://github.com/milamarcheva, thank you for making this pull request! I haven't thought about how or whether to handle words annotated with a preceding 0, so this is a good opportunity to reflect on this.

Thank you for indicating the source data (CHILDES -> Biling -> Perez -> Shelia) for an instance of a 0-word. I've taken a look at the CHAT data file https://sla.talkbank.org/TBB/childes/Biling/Perez/Shelia/021101.cha to see if there are clues to help decide what to do with these 0-words. I found the occurrence of "*CHI: I 0am done ." (utterance #136 in the data file), but this file has the transcribed utterances only and doesn't have dependent tiers such as %mor and %gra. I spot-checked other CHILDES datasets, and found a 0-word instance with %mor and %gra: https://sla.talkbank.org/TBB/childes/Eng-NA/Brown/Eve/010600a.cha, in utterance #2:

MOT: you 0v more cookies ? %mor: pro:per|you 0v|v qn|more n|cookie-PL ? %gra: 1|2|SUBJ 2|0|ROOT 3|4|QUANT 4|2|OBJ 5|2|PUNCT

In this example, the "0v" in the utterance corresponds to "0v|v" in the %mor tier and to "2|0|ROOT" in %gra. This example suggests that although the 0-words aren't part of the produced speech or are inaudible somehow (as you've also pointed out), they still have a role in other annotation tiers. For pylangacq, an important goal is to correctly align the pieces across an utterance and its associated %mor and %gra tiers (if available) to create the parsed tokens https://pylangacq.org/transcriptions.html#tokens. I see you're proposing to remove these 0-words in your commit https://github.com/jacksonllee/pylangacq/pull/20/commits/d7c6387e60c47990bbbc5dc510628436284d4d0c, but given this example of "0v" from the American English Brown dataset, it would seem like pylangacq should not drop these 0-words, or else it wouldn't be able to align the utterance with the %mor and %gra tiers.

To think out loud a bit more -- If a code change is needed within pylangacq, what are the options? I see the following:

  1. Drop the 0-words as you've proposed, but the problem is that pylangacq wouldn't correctly align the utterances with %mor and %gra tiers, as explained above.
  2. Keep the 0-words, but just remove the "0"? Not good, since there would be no indication that these words either aren't in the actual speech or are inaudible.
  3. Keep the 0-words, but remove the "0" and find another way to indicate the non-existence of these words. But what way? Is this the purpose of the "0" in the first place?
  4. Do nothing. If these untreated 0-words affect a pylangacq user, then the user has to handle these 0-words on their own. For instance, if a user is interested in word count in general, then the 0-words slightly inflate the word count numbers, in which case the user could detect and subtract these 0-words.

Option 1 is a deal breaker for pylangacq. Options 2 and 3 don't make sense. So I'm leaning towards option 4 for no code change needed.

Am I missing something? Let me know what you think, and thank you again for raising the issue!

— Reply to this email directly, view it on GitHub https://github.com/jacksonllee/pylangacq/pull/20#issuecomment-1886117539, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMA6FGFOBWSYXLIZAZXJQH3YN5GT7AVCNFSM6AAAAABBVGPMMOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBWGEYTONJTHE . You are receiving this because you were mentioned.Message ID: @.***>

milamarcheva avatar Jan 12 '24 19:01 milamarcheva

Thank you, Mila, for using pylangacq and for your interest in contributing to it -- really appreciate it!

  • +/ -- interruption, the current library leaves the slash in the processed string. I could attempt to fix that

It looks like pylangacq doesn't handle +/ currently, but because CHILDES / TalkBank datasets are updated from time to time, it's possible that +/ is a new thing that pylangacq might need to deal with. May I know which CHILDES / TalkBank data files have occurrences of +/? Just wondering if I should take a quick look first before you start putting together another pull request.

  • [/] -- repetition; [//] -- retracing: the library removes any repeated words or phrases, but I think an option to leave them behind might be useful in some cases, when focusing on production

If you're interested in the original, unparsed utterance (with the repeated words retained, among other things), the utterance objects preserve the original tiers from CHAT data. Please let me know if it's not clear how to access the unparsed utterance line.

In case email (rather than the public GitHub platform here) is a preferred way to discuss these or any other questions/ideas you may have, I'm reachable at [email protected]

jacksonllee avatar Jan 12 '24 21:01 jacksonllee