umap icon indicating copy to clipboard operation
umap copied to clipboard

Feature importance of UMAP output

Open dangkunal opened this issue 3 years ago • 3 comments

Hi,

I am learning about visualizing multi dimensional data, So i found UMAP and t-SNE but by any chance can we also get the feature importance of the output.

By feature importance i mean that which variables are contributing most to the UMAP output, I know my question might be incorrect but i was curious and still learning.

Thanks, Kunal

dangkunal avatar Oct 12 '20 20:10 dangkunal

I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it):

  1. Fit a UMAP embedding to your full data. We'll call this the "canonical" projection
  2. Pick a column to calculate feature importance on
  3. Randomly shuffle the values in this column. Call the dataset with column i shuffled D_i
  4. Fit a new UMAP embedding to D_i. For stability, we probably want to use some variation of the AlignedUMAP feature coming 0.5.0
  5. Calculate some summary statistic to quantify distance between these two embedding spaces. I'm thinking maybe earth movers distance?
  6. Reset the column to its unshuffled state. Rinse and repeat for all columns.

The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here.

One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think.

dmarx avatar Jan 05 '21 19:01 dmarx

I definitely agree with David's idea of sensitivity instead of importance. I see that you are suggesting using the AlignedUMAP in 0.5.0 to reduce the embedding noise due to the stochastic nature of the embedding (i.e. if you run the algorithm a few times with no feature permutation you'll get a different embedding). Reducing that stochasticity is definitely an important step. An alternate method for eliminating the effects of this stochasticity might be to measure the disruption of the UMAP complex itself. You could do the above game but instead of measuring the impact on the embedding, simply measure the difference between the representations of your data (found in my_model.graph_). Perhaps you could use cross entropy (as we do in UMAP) to measure the difference between the 'canonical' graph and the one induced with the permuted column. This has the added benefit of eliminating any effects that might be caused by selecting an embedding dimension that is too low to easily represent your data.

If you wanted to go a step further you make the whole thing far more efficient by popping the hood on the code and grabbing out the fuzzy_simplicial_set() function to build the UMAP complex (graph) without going through the computationally heavy steps of actually ever embedding the data.

On Tue, Jan 5, 2021 at 2:40 PM David Marx [email protected] wrote:

I think a way that we could rethink this would be as "sensitivity" rather than "importance". In other words, if we were to define "feature importance" as an answer to a question, that question might be: "How sensitive is the UMAP projection to fluctuations in the respective dimensions of the data space?" Here's one way we might answer this (which would be pretty heavy to compute, but could be interesting if you need it):

  1. Fit a UMAP embedding to your full data. We'll call this the "canonical" projection
  2. Pick a column to calculate feature importance on
  3. Randomly shuffle the values in this column. Call the dataset with column i shuffled D_i
  4. Fit a new UMAP embedding to D_i. For stability, we probably want to use some variation of the AlignedUMAP feature coming 0.5.0
  5. Calculate some summary statistic to quantify distance between these two embedding spaces. I'm thinking maybe earth movers distance?
  6. Reset the column to its unshuffled state. Rinse and repeat for all columns.

The distance calculated in step 5 then gives us an approximate measure for how sensitive the topology of the canonical embedding space is to changes in that particular dimension, which I posit is roughly what you're looking for in a "feature importance" measure here.

One potential problem I'm foreseeing here is the application of the AlignedUMAP. On the one hand, we sort of have to use it to make sure we can compare the projections (I think?). On the other hand, the parameters of the alignment estimator will probably impact the distance score. The relative distance scores should still be meaningful though, I'd think.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/505#issuecomment-754854864, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWU6X3FHYK66Q6VE7YLSYNTJNANCNFSM4SNJUZNQ .

jc-healy avatar Jan 12 '21 17:01 jc-healy

Would love to see something like this implemented in UMAP! In the case of gene expression matrices in scRNA-seq data, could be extremely useful for identifying which genes are most strongly influencing the latent representation.

bschilder avatar Nov 09 '21 21:11 bschilder