captum icon indicating copy to clipboard operation
captum copied to clipboard

Questions About Statistical Significance Testing in TCAV

Open baselmousi opened this issue 3 years ago • 3 comments

❓ Questions and Help

Hello,

I was running statistical significance testing using the code defined in the TCAV for NLP sentiment analysis tutorial and I have two questions about the extract_scores and the get_pval functions. Here are my questions:

  1. Why does the sign_count and the magnitude score tensors obtained for each layer contain two numeric elements? what does each number represent?
  2. When we run the get_pval function, we calculate P1 and P2 by extracting the scores and setting the idx variable to either 0 or 1. Thus the difference between P1 and P2 is just which element of the obtained sign_count tensors we are selecting. what does P1 and P2 represent and how is the p value calculated using them?

Thanks for the help, it is really appreciated!

baselmousi avatar Jun 28 '22 11:06 baselmousi

Hi @basselmawzi thank you for the questions!

  1. In the example, each experimental set contains 2 concepts. So that is why those 2 tensors both contain 2 elements - there is a sign_count for each of the 2 concepts.
  2. P1 is a list of scores (of the specified score_type) - its length is the number of experimental sets (5 in this case). The difference between P1 and P2 is that each experimental set consists of 2 concepts, where the 1st represents "positive", and the 2nd represents "neutral". P1 corresponds to scores for the "positive" concept, over different experimental sets, and P2 corresponds to scores for the "neutral" concept, over different experimental sets. You could think of the 5 experimental sets as representing separate runs of TCAV, with the same 2 concepts, on 5 different sets of data. So the p value is calculated by testing whether the distribution of the scores for positive and neutral are from the same distribution, over those different sets of data.

99warriors avatar Jul 02 '22 13:07 99warriors

Thanks for the detailed reply @99warriors !

I understand what's happening with the code but I couldn't interpret it. I thought that we run statistical significance testing in order to ensure that the concept has high TCAV scores across all experimental sets. If it has consistent high scores across all experimental sets we'd accept it as being a meaningful concept. Couldn't we just inspect the value of P1 and ensure that the scores are high and consistent? why are we considering P2?

Also are P1 and P2 always complementary?

Thanks for the help in advance.

baselmousi avatar Jul 18 '22 20:07 baselmousi

  1. @basselmawzi according to the official paper the TCAV score can be summarized across multiple examples based on the signs or the magnitudes of the TCAV scores for each individual example. We offer the sign and magnitude based summarization of TCAV scores across all examples in the test batch. The official equation in the paper is sign based but we offer magnitude based option as well. You can check it out in the paper.
  2. Here we perform 2-sided t-test as also described in the paper to ensure that the distributions of the TCAV scores for positive concept are significantly different from the neutral concept. In order to perform statistical significance test we need the distributions of both concepts because it could also be that neutral concept gets high TCAV scores and we want to make sure that neutral concept has consistently low score across all examples in the experiential set. This is described in section 3.5. Statistical significance testing of the paper.

Let us know if you have any questions.

NarineK avatar Jul 21 '22 01:07 NarineK

Okay. It's clear Now

baselmousi avatar Sep 01 '22 07:09 baselmousi