ontology-access-kit
ontology-access-kit copied to clipboard
Jaccard similarity differences when using information content
I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.
The first one was without using any information content files:
runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
-O csv \
-o semsim_without_ic_file.tsv
Next, I used the same parameters, just including --information-content-file option:
runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file phenio_monarch_hp_mp_ic.tsv \
-O csv \
-o semsim_with_ic_file.tsv
The HP and MP terms' information content files were generated separately and merged into a final file.
runoak -i phenio.db -g gene_phenotype.9606.tsv -G hpoa_g2p information-content -p i i^HP: -o phenio_monarch_hp_ic.tsv
runoak -i phenio.db -g gene_phenotype.10090.tsv -G hpoa_g2p information-content -p i i^MP: -o phenio_mp_ic.tsv
Here are some exploratory analysis regarding jaccard similarity comparisons
| property | semsim_without_ic | semsim_with_ic |
|---|---|---|
| count | 1,485,387.00 | 1,522,836.00 |
| mean | 0.44 | 0.44 |
| std | 0.03 | 0.03 |
| min | 0.40 | 0.40 |
| 25% | 0.41 | 0.41 |
| 50% | 0.43 | 0.43 |
| 75% | 0.46 | 0.46 |
| max | 0.70 | 0.70 |
Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.
| subject_id | object_id | jaccard_similarity_without_ic | jaccard_similarity_with_ic | difference |
|---|---|---|---|---|
| HP:0025477 | MP:0013304 | 0.416667 | 0.481481 | 15.56% |
| HP:0025477 | MP:0012070 | 0.416667 | 0.481481 | 15.56% |
| HP:0025477 | MP:0030485 | 0.416667 | 0.481481 | 15.56% |
| HP:0025477 | MP:0031348 | 0.416667 | 0.481481 | 15.56% |
| HP:0025477 | MP:0005422 | 0.416667 | 0.481481 | 15.56% |
| HP:0002514 | MP:0000783 | 0.465116 | 0.425532 | -9.30% |
| HP:0005671 | MP:0000783 | 0.454545 | 0.416667 | -9.09% |
| HP:0007045 | MP:0000783 | 0.454545 | 0.416667 | -9.09% |
| HP:0002514 | MP:0000787 | 0.5 | 0.458333 | -9.09% |
| HP:0005849 | MP:0000783 | 0.454545 | 0.416667 | -9.09% |
Very nice ticket, subscribing with interest to the thread.
Certainly strange and unexpected. Is the behavior reproducible with a smaller set of terms? Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian? I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (https://github.com/INCATools/ontology-access-kit/blob/aef85c609d72da7fb72cb463d846c83d4d9664fd/src/oaklib/interfaces/semsim_interface.py#L224-L228)