ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

ParlaMint-SI: additional metadata files for sentiment?

Open katjameden opened this issue 9 months ago • 2 comments

The planned new version of the ParlaMint-SI corpus will, in addition to sentence-level sentiment, also include sentiment annotations for whole utterances (i.e. speech- level sentiment).

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

Would it be possible to add new metadata files focussing on the sentiment (e.g. ID, annotated element (u or s), sentiment class and numeric value for the sentiment)? This would in turn allow easier (pre-)processing of the corpus for further analyses/research, as the sentiment would be included as metadata and would eliminate the need to extract it from TEI.ana.

katjameden avatar Mar 24 '25 14:03 katjameden

This could be included in our metadata files (*-meta.tsv). However, since SI will be the only corpus containing this additional information, the other corpora would be missing this information in their metadata files (resulting in columns that would be empty in other 28 corpora).

I think it is possible to add more columns into -meta.tsv and leave an empty value for all the corpora, that will be done in the future. Eg, we have agenda that is not present in other corpora.

But the question is whether it belongs to meta as it is annotation of the data - it is not describing the setting of the speech, but rather the content.

So, do we want to add another format?

  • We have *.txt which is formated as tsv (without column names), but I do not think it the values belongs there either.
  • Does it make sense to introduce another (two) tsv formats, one for u-level and one for s-level? sketch:
    • utterance id
    • element id (s or u)
    • orig id (reference to source sentence for english translation)
    • language
    • text
    • sentiment class
    • sentiment value
    • ?? some other possible numbers/stats
      • number of tokens
      • number of named entities
      • ...
    • ?? and in future when we will have audio alignment in TEI files
      • start time
      • end time
      • audio file ref
  • But if we introduce another format, will be the rest of corpora without this format?

I have no strong opinion on that (yet).

matyaskopp avatar Mar 27 '25 07:03 matyaskopp

Yes, this was exactly my thinking, i.e. we introduce one more set of tsv files, called e.g. component-name.ana-meta.tsv and we add them to the ParlaMint-XX.conllu/ directories. Note that sentiment annotation is only added to TEI.ana files, and that all ParlaMint corpora will get s-level senti annotation in Parla-CAP.

The files should have a header row, and I'd suggest these are the columns:

  1. id
  2. element (s or u)
  3. language (can be several with u!)
  4. sentiment value (for u empty except for SI)
  5. sentiment 6 class (ditto)
  6. sentiment 3 class (ditto)
  7. number of sentences (always 1 for s element of course)
  8. number of words
  9. number of tokens
  10. number of named entities
  11. (maybe other numbers if it makes sense, e.g. number of UD PoSes or syntactic relations, all in one column like "NOUN:5 ADJ:3 ...")
  12. audio file ref (in future when we will have audio alignment in TEI files )
  13. audio start time (ditto)
  14. audio end time (ditto)

TomazErjavec avatar Mar 30 '25 14:03 TomazErjavec

Although most of the scripts for adding sentiment and topic to the corpora have been already made, this issue has not been addressed yet. I guess either me or @matyaskopp should make the script if we have the definitive list of columns for the files.

What I did - for now - is to add the s-level sentiment score directly to CoNLL-U files, it doesn't hurt, and, in fact, with this we don't, strictly speaking even need the envisaged extra TSVs, as the info is in CoNLL-U. Right now only s-level sentiment is encoded, but I guess (for SI) u-level could be added in the same way. The format is like this:

# newdoc id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u1
# newpar id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1
# lang = sl
# sent_id = ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg1.1
# senti_3 = Positive
# senti_6 = mixed positive
# senti_n = 4.26
# text = Spoštovani gospod predsednik Republike Slovenije, gospod Milan Kučan!

TomazErjavec avatar May 16 '25 07:05 TomazErjavec

I guess either me or @matyaskopp should make the script

I had a look, and I guess I should do it, given that I made the parlamint2meta.xsl script, and this will be similar. It would still be nice if somebody commented on the list of fields that the file should have - except if it is perfect as it is!

TomazErjavec avatar May 16 '25 10:05 TomazErjavec

While adding .ana TSV files has been done, a merge destroyed calling the new parlamint2meta.ana.xsl script in the build process, to be corrected shorty (#903).

The current script has the first columns as above, i.e.

  1. id
  2. element (s or u)

Following a complaint by @nljubesi that it is difficult to determine the ID of the utterance that contains a particular sentence, I'd fix the script to also have the ID of its relevant superordinate element, which woud be the ID of u for sentences and the ID of the TEI element (so, filename) for utterances. The first one is a no brainer, while the second one would seem redundant, as the TSV filename already gives the TEI ID, but this will allow all the TSV files to be concatenated and still retain the information to which file each one belongs.

TomazErjavec avatar Jun 04 '25 08:06 TomazErjavec

If the additional information makes it easier to distinguish between the IDs, I would support this addition, even if one of the IDs were somewhat redundant. I also think that the list of proposed fields looks good as-is.

katjameden avatar Jun 04 '25 09:06 katjameden