PyGenePlexus
PyGenePlexus copied to clipboard
get list of genes that is important for similairty tables
Try to figure out which genes from the user model were the most important to a given term in the similarity.
Some thoughts on this from Slack
Given the two models with coefficients [a_1, a_2, …, a_p] & [b_1, b_2, …, b_p] (where p is the number of embedding dimensions) — one trained based on the input and the other for the ‘similar’ term — a simple way to do this might be to do the following: Given each gene’s embedding vector [x_1, x_2, …, x_p], calculate a score for the gene as the cosine similarity between it’s embedding vector and the vector [a_1+b_1, a_2+b_2, …, a_p+b_p], rank gene’s by this score, and report the top few.
However some things to think about
- If a term is not similar, are top genes confusing then?
- How to get scores for the genes (more z-scores?). Presenting top ten might not mean much.