ParlaMint icon indicating copy to clipboard operation
ParlaMint copied to clipboard

Adding sentiment to corpora

Open TomazErjavec opened this issue 1 year ago • 13 comments

In ParlaCAP sentiment scores and labels will be added to sentences and utterances and this issue serves to document the needed changes:

  • how to encode sentiment
  • changes to the schema and documentation
  • changes to the conversion programs
  • changes to the registry files

Note that I have made a new milestone ParlaCAP (and assigned this issue to it) which should be used for issues pertaining to the project. The debate here should be releveant to @matyaskopp, @katjameden, @nljubesi.

TomazErjavec avatar Jan 05 '25 11:01 TomazErjavec

For SI we have already added sentiment to <u> and <s> as well as modifying the schema and conversion to vertical file + adding a new sentiment taxonomy, currently local to SI (this is a draft and might well change, esp. the description part).

In short:

  • the sentiment label is encoded in u/@ana and s/@ana, and makes reference to the sentiment taxonomy and uses the extended prefix senti.
  • the sentiment score is encoded in u/@n and s/@n; this is not a very good solution, as @n is a very general attribute but I currently do not have a better idea how to preserve the score in the encoding.

E.g. https://github.com/clarin-eric/ParlaMint/blob/6dd236d0d46c419d578371e634af53e5df1b8784/Samples/ParlaMint-SI/2007/ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.xml#L144 https://github.com/clarin-eric/ParlaMint/blob/6dd236d0d46c419d578371e634af53e5df1b8784/Samples/ParlaMint-SI/2007/ParlaMint-SI_2007-11-28-SDZ4-Izredna-30.ana.xml#L146

The new ParlMint-SI is also available via the concordancer for testing.

The documentation and other conversions (in particular, to TSV) still need to be implemented.

The problem right now is that I've added the new taxonomy to all the relevant programs that deal with taxonomies but now the CI validation complains that this taxonomy is missing from all the corpora (except SI). @matyaskopp, how best to solve this? Make (somehow) the taxonomy optional or (manually) insert the taxonomy, it's XInclude and prefixDef into all the samples? Or something else?

TomazErjavec avatar Jan 05 '25 11:01 TomazErjavec

I was unhappy with using @n for the sentiment score, as @n is a very general attribute. I am now considering an alternative, namely using <measure/> for encoding everything to do with sentiment, so we would have e.g.

<u who="#BernikJožef"
   xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u31" ana="#chair">
   <measure type="sentiment" quantity="2.47" ana="senti:neuneg"/>
   <seg xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110">
      <s xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110.1">
         <measure type="sentiment" quantity="4.08" ana="senti:mixpos"/>
         <w xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110.1.1"
            msd="UPosTag=NOUN|Case=Nom|Gender=Fem|Number=Sing"
            ana="mte:Ncfsn"
            lemma="hvala">Hvala</w>

If I don't get any flames for this suggestion I will redo the SI corpus, schema and scripts to take into account this new encoding.

TomazErjavec avatar Mar 24 '25 15:03 TomazErjavec

Besides discussing how sentiment is to be encoded, I would like to touch upon which sentiment should be encoded.

Are you still of the position of encoding

  • predicted sentiment on sentence level?
  • length-averaged sentiment on utterance level?

You know my position that the second one is of questionable usability (it tends to group around neutral sentiment due to the nature of the speeches, most parts being neutral, some positive, some negative, the last two cancelling themselves out).

I am kind-of ok with encoding both, but am still of the nasty opinion that it might not be smart to add utterance-level sentiment to concordancers. This one might seem easier to be used, but does not encode most of the sentiment in the corpus, would therefore be much less informative for downstream analyses.

nljubesi avatar Mar 24 '25 15:03 nljubesi

Yes, we definitelly want utterance level sentiment for ParlaMint-SI. However, @katjameden did not use heuristics to determine u-level sentiment from s-level ones, rather, she used the manually annotated u-level sentiment for SI (with better agreement scores than s-level had!) to train a random forest model (your idea actually, I think), and the prediciton results are I think also better than the s-level ones. So.... (The downside of this approach is, of course, that we have u-level senti only for SI).

Also, s-level sentiment will still be encoded in the corpus and concordancers, and people are free to use it if they prefer.

TomazErjavec avatar Mar 25 '25 09:03 TomazErjavec

I was unhappy with using @n for the sentiment score, as @n is a very general attribute. I am now considering an alternative, namely using <measure/> for encoding everything to do with sentiment, so we would have e.g.

<u who="#BernikJožef"
   xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u31" ana="#chair">
   <measure type="sentiment" quantity="2.47" ana="senti:neuneg"/>
   <seg xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110">
      <s xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110.1">
         <measure type="sentiment" quantity="4.08" ana="senti:mixpos"/>
         <w xml:id="ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.seg110.1.1"
            msd="UPosTag=NOUN|Case=Nom|Gender=Fem|Number=Sing"
            ana="mte:Ncfsn"
            lemma="hvala">Hvala</w>

If I don't get any flames for this suggestion I will redo the SI corpus, schema and scripts to take into account this new encoding.

Sorry, I missed this conversation; I do not like @n either.

Using measure is better, but still not sure if it was initially meant to use it this way for annotations. I was not able to find a better solution for annotating various elements with the category and numbers describing their affiliation with the category. I have two notes:

  • maybe add the attribute corresp that will bind strictly the annotated element
  • it is not obvious the scale of the numbers quantity="2.47" means nothing to me, if it is percentage, then maybe also consider adding unit

matyaskopp avatar Mar 25 '25 11:03 matyaskopp

Using measure is better, but still not sure if it was initially meant to use it this way for annotations. I was not able to find a better solution for annotating various elements with the category and numbers describing their affiliation with the category.

I agree that it was not really meant for this, on the other hand, I was not able to find any other solution either, so, I guess we go with measure.

I have two notes:

* maybe add the attribute `corresp` that will bind strictly the annotated element

You mean like

<measure type="sentiment" quantity="2.47" ana="senti:neuneg" corresp="#ParlaMint-SI_2000-10-27-SDZ3-Redna-01.ana.u31"/>

We of course can do that, although I don't really see the point, given that measure is included in the annotated element.

* it is not obvious the scale of the numbers `quantity="2.47"` means nothing to me, if it is percentage, then maybe also consider adding `unit`

I did, but there is no obvious unit here - it's simply a scale from 0-5. I think this should be documented in the corpus TEI header somewhere.

TomazErjavec avatar Mar 25 '25 11:03 TomazErjavec

We of course can do that, although I don't really see the point, given that measure is included in the annotated element.

I was thinking about a possible situation, where we want to annotate w with measure element. w does not allow measure child, so it would be necessary to place it outside, so corresp attribute would be the must.

I am saying it from my experience with anchor in audio alignment (I did not use the corresp in ParCzech, which I now regret)

I did, but there is no obvious unit here - it's simply a scale from 0-5. I think this should be documented in the corpus TEI header somewhere.

An option is to use unitRef but it would require The Unit Declaration. However, I am unsure if I have enough courage to introduce this, as I don’t know what else might be required to manage.

matyaskopp avatar Mar 25 '25 16:03 matyaskopp

OK, I changed the RNG + TEI schema to support measure and will also use @corresp, indeed, better safe than sorry. Will now also change the conversion to vertical so that this change is reflected there as well. Note that we have another issue about how to get the sentiment annotation into metadata files, cf. #897.

As for unitRef, wow, I wasn't even aware that this exists.. It is tempting to use it, but, indeed, it might be more complicated than it seems, also, introducing a whole new division into the teiHeader just for one annotation might be excessive.

I'd prefer to ponder this a bit, as, anyway, sentiment still needs to be introduced into the Guidelines, as well as metadata as above.

TomazErjavec avatar Mar 26 '25 10:03 TomazErjavec

Yes, we definitelly want utterance level sentiment for ParlaMint-SI. However, @katjameden did not use heuristics to determine u-level sentiment from s-level ones, rather, she used the manually annotated u-level sentiment for SI (with better agreement scores than s-level had!) to train a random forest model (your idea actually, I think), and the prediciton results are I think also better than the s-level ones. So.... (The downside of this approach is, of course, that we have u-level senti only for SI).

Also, s-level sentiment will still be encoded in the corpus and concordancers, and people are free to use it if they prefer.

s-level sentiment is the only thing that I would trust. For SI, 1 out of 29 datasets, u-level sentiment is probably ok, but I still have reservations. For u-level sentiment in 28 out of 29 datasets I personally do not know how I would use it, except as a much weaker data selection method than s-level sentiment. I simply do not see a viable use case in concordancers where u-level sentiment should be preferred to s-level sentiment.

I most strongly suggest s-level sentiment is considered primary, and u-level sentiment might "still be encoded" if you are sure enough that it will not be used except if people know what they are doing with it (me still not seeing a viable use case).

@katjameden is it maybe the perfect time for you to offer a corpus-based analysis of sentiment via different criteria? It would be very interesting to see how conclusions from s-level u-level-machinelearning u-level-heuristic compare. I think this would be a very decent chapter / section in your PhD. Maybe you planned such a thing?

nljubesi avatar Mar 26 '25 10:03 nljubesi

s-level sentiment is the only thing that I would trust.

You repeat this a lot, but never actually say why not and I still don't understand what the problem with u-level is. Of course, if you do heuristics from s-level to u-level, the results will be dubious, but @katjameden did ML over a hand annotated u-level dataset, which is exactly what was done with s-level, so both should be trustworthy about the same.

I simply do not see a viable use case in concordancers where u-level sentiment should be preferred to s-level sentiment.

noSkE is focused on text-level metadata, and a speech is one "text" in ParlaMint. It is not impossible to mix attributes from two structures in a query, but it is much easier to work with speeches only. The other reason is that a speech is a "natural" category that we work with all the time, with rather than individual sentences, where it is a lot less intuitive what one is actually investigating.

@katjameden now actually did a brief analysis comparing various u-level sentiment-dependent analyses with s-level ones. She might want to comment furhter, but the main insight is that both compare rather well in terms of differences and trends, but with s-level you get a lot more neutral sentiment. Which is, thinking about it, what one would expect - even in rather negative (or positive) speeches you will still have a lot of neutral sentences. From this I would conclude it is actually better to use u-level, as extreme sentiment does not get burried in the neutral one.

TomazErjavec avatar Apr 03 '25 17:04 TomazErjavec

@katjameden now actually did a brief analysis comparing various u-level sentiment-dependent analyses with s-level ones. She might want to comment furhter, but the main insight is that both compare rather well in terms of differences and trends, but with s-level you get a lot more neutral sentiment. Which is, thinking about it, what one would expect - even in rather negative (or positive) speeches you will still have a lot of neutral sentences. From this I would conclude it is actually better to use u-level, as extreme sentiment does not get burried in the neutral one.

Yes, this is essentially the main insight I was able to draw from this brief analysis, which was more or less adapted for sentence-level sentiment based on my initial investigation of the utterance-level sentiment. However, I would like to pursue this a bit further, and am planning to compare the s- and u-level sentiment a bit more directly in an extended analysis. Also, @nljubesi, thank you for your suggestion, I do agree this will be a good addition to the sentiment section of PhD.

katjameden avatar Apr 04 '25 11:04 katjameden

We might want to discuss this in person after the next CLARIN.SI meeting as it seems to me that we do not understand each other. I am coming to this conclusion from your claim that s-level sentiment has more neutral sentiment. This cannot hold due to the averaging that is happening on the u-level, and averaging naturally pushes towards centre (neutral sentiment). The numbers clearly state this (calculated to make sure that I am not losing it):

Bulgarian s-level Counter({'Neutral': 858535, 'Negative': 459765, 'Positive': 308324}) Bulgarian u-level Counter({'Neutral': 142251, 'Negative': 50720, 'Positive': 17046})

Slovenian s-level Counter({'Neutral': 2098488, 'Negative': 1301786, 'Positive': 475909}) Slovenian u-level Counter({'Neutral': 236414, 'Negative': 71828, 'Positive': 3105})

Croatian s-level Counter({'Neutral': 1710557, 'Negative': 1609018, 'Positive': 892898}) Croatian u-level Counter({'Neutral': 335491, 'Negative': 142359, 'Positive': 26484})

So neutral is prevalent in both views, but on u-level this tendency is twice as strong. But I must assume that we are talking about different things, so I think that a chat in person will be the most constructive approach here.

nljubesi avatar Apr 04 '25 12:04 nljubesi

We might want to discuss this in person after the next CLARIN.SI meeting as it seems to me that we do not understand each other.

Indeed, let's do that.

is cannot hold due to the averaging that is happening on the u-level

There is no averaging going on. For u-level the ML was learned from manually annotated data.

TomazErjavec avatar Apr 04 '25 14:04 TomazErjavec

I think this has all been settled now, i.e. we know how to encode the sentiment, only SI has u-level sentiment, which all the others have only s-level, and the prototype SI corpus has been mounted on the dev concordancer and no problems have been found there. The files with added sentiment should soon appear in the Samples directory. So, closing this, thank you all for your help!

TomazErjavec avatar May 16 '25 07:05 TomazErjavec