ontology-access-kit icon indicating copy to clipboard operation
ontology-access-kit copied to clipboard

Jaccard similarity differences when using information content

Open souzadevinicius opened this issue 1 year ago • 2 comments

I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.

The first one was without using any information content files:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
-O csv \
-o semsim_without_ic_file.tsv

Next, I used the same parameters, just including --information-content-file option:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file  phenio_monarch_hp_mp_ic.tsv \
-O csv \
-o semsim_with_ic_file.tsv

The HP and MP terms' information content files were generated separately and merged into a final file.

runoak -i phenio.db -g gene_phenotype.9606.tsv -G hpoa_g2p information-content -p i i^HP: -o phenio_monarch_hp_ic.tsv
runoak -i phenio.db -g gene_phenotype.10090.tsv -G hpoa_g2p information-content -p i i^MP: -o phenio_mp_ic.tsv

Here are some exploratory analysis regarding jaccard similarity comparisons

property semsim_without_ic semsim_with_ic
count 1,485,387.00 1,522,836.00
mean 0.44 0.44
std 0.03 0.03
min 0.40 0.40
25% 0.41 0.41
50% 0.43 0.43
75% 0.46 0.46
max 0.70 0.70

Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.

subject_id object_id jaccard_similarity_without_ic jaccard_similarity_with_ic difference
HP:0025477 MP:0013304 0.416667 0.481481 15.56%
HP:0025477 MP:0012070 0.416667 0.481481 15.56%
HP:0025477 MP:0030485 0.416667 0.481481 15.56%
HP:0025477 MP:0031348 0.416667 0.481481 15.56%
HP:0025477 MP:0005422 0.416667 0.481481 15.56%
HP:0002514 MP:0000783 0.465116 0.425532 -9.30%
HP:0005671 MP:0000783 0.454545 0.416667 -9.09%
HP:0007045 MP:0000783 0.454545 0.416667 -9.09%
HP:0002514 MP:0000787 0.5 0.458333 -9.09%
HP:0005849 MP:0000783 0.454545 0.416667 -9.09%

souzadevinicius avatar May 01 '24 11:05 souzadevinicius

Very nice ticket, subscribing with interest to the thread.

matentzn avatar May 01 '24 12:05 matentzn

Certainly strange and unexpected. Is the behavior reproducible with a smaller set of terms? Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian? I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (https://github.com/INCATools/ontology-access-kit/blob/aef85c609d72da7fb72cb463d846c83d4d9664fd/src/oaklib/interfaces/semsim_interface.py#L224-L228)

caufieldjh avatar May 01 '24 18:05 caufieldjh