PyGenePlexus icon indicating copy to clipboard operation
PyGenePlexus copied to clipboard

get list of genes that is important for similairty tables

Open ChristopherMancuso opened this issue 5 months ago • 0 comments

Try to figure out which genes from the user model were the most important to a given term in the similarity.

Some thoughts on this from Slack

Given the two models with coefficients [a_1, a_2, …, a_p] & [b_1, b_2, …, b_p] (where p is the number of embedding dimensions) — one trained based on the input and the other for the ‘similar’ term — a simple way to do this might be to do the following: Given each gene’s embedding vector [x_1, x_2, …, x_p], calculate a score for the gene as the cosine similarity between it’s embedding vector and the vector [a_1+b_1, a_2+b_2, …, a_p+b_p], rank gene’s by this score, and report the top few.

However some things to think about

  1. If a term is not similar, are top genes confusing then?
  2. How to get scores for the genes (more z-scores?). Presenting top ten might not mean much.

ChristopherMancuso avatar Sep 12 '24 14:09 ChristopherMancuso