ontology-access-kit Jaccard similarity differences when using information content

I conducted a semantic similarity calculation experiment where I noticed that the Jaccard Similarity Score values of certain records differed when I used the --information-content-file option. I am unsure of the reason behind this behavior and have documented the experiment details in case anyone would like to reproduce it. If anyone can explain these differences, I would appreciate it.

The first one was without using any information content files:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
-O csv \
-o semsim_without_ic_file.tsv

Next, I used the same parameters, just including --information-content-file option:

runoak -i semsimian:sqlite:phenio.db similarity -p i \
--set1-file hp_terms.txt \
--set2-file mp_terms.txt \
--min-jaccard-similarity 0.4 \
--information-content-file  phenio_monarch_hp_mp_ic.tsv \
-O csv \
-o semsim_with_ic_file.tsv

The HP and MP terms' information content files were generated separately and merged into a final file.

runoak -i phenio.db -g gene_phenotype.9606.tsv -G hpoa_g2p information-content -p i i^HP: -o phenio_monarch_hp_ic.tsv

runoak -i phenio.db -g gene_phenotype.10090.tsv -G hpoa_g2p information-content -p i i^MP: -o phenio_mp_ic.tsv

Here are some exploratory analysis regarding jaccard similarity comparisons

property	semsim_without_ic	semsim_with_ic
count	1,485,387.00	1,522,836.00
mean	0.44	0.44
std	0.03	0.03
min	0.40	0.40
25%	0.41	0.41
50%	0.43	0.43
75%	0.46	0.46
max	0.70	0.70

Although the percentiles have the same value, 38,798 records differ in their jaccard similarity values. To identify the most extreme differences, I selected the top 10 records. Out of these 10, five showed an increase in the jaccard score value when an external IC file was passed during calculation, and five showed a decrease in the score.

subject_id	object_id	jaccard_similarity_without_ic	jaccard_similarity_with_ic	difference
HP:0025477	MP:0013304	0.416667	0.481481	15.56%
HP:0025477	MP:0012070	0.416667	0.481481	15.56%
HP:0025477	MP:0030485	0.416667	0.481481	15.56%
HP:0025477	MP:0031348	0.416667	0.481481	15.56%
HP:0025477	MP:0005422	0.416667	0.481481	15.56%
HP:0002514	MP:0000783	0.465116	0.425532	-9.30%
HP:0005671	MP:0000783	0.454545	0.416667	-9.09%
HP:0007045	MP:0000783	0.454545	0.416667	-9.09%
HP:0002514	MP:0000787	0.5	0.458333	-9.09%
HP:0005849	MP:0000783	0.454545	0.416667	-9.09%

May 01 '24 11:05 souzadevinicius

Very nice ticket, subscribing with interest to the thread.

May 01 '24 12:05 matentzn

Certainly strange and unexpected. Is the behavior reproducible with a smaller set of terms? Or rather, does it happen when you use the basic OAK semsim implementation rather than semsimian? I ask because I didn't think the semsimian implementation did anything with the information-content-file input; the semsim interface will cache the provided values here (https://github.com/INCATools/ontology-access-kit/blob/aef85c609d72da7fb72cb463d846c83d4d9664fd/src/oaklib/interfaces/semsim_interface.py#L224-L228)

May 01 '24 18:05 caufieldjh

ontology-access-kit ontology-access-kit copied to clipboard

Jaccard similarity differences when using information content

ontology-access-kit
ontology-access-kit copied to clipboard