umap icon indicating copy to clipboard operation
umap copied to clipboard

Semi-supervised dimensionality reduction with continuous-valued labels (regression problem)

Open idekany opened this issue 2 years ago • 1 comments

Hi, From the documentation, I understand how UMAP supports semi-supervised problems if the target_metric is categorical (default), i.e., by masking unlabelled data with -1. I have used this feature with success.

I would also like to use the semi-supervised functionality with a continuous-valued response variable of a regression problem (i.e., with "floating-point labels"). The goal is to compute good embeddings to be used as features by a downstream predictive model. Using the metric learning functionality in this case with target_metric='l2' has led to relative success in certain scenarios of train-test splitting, but I would like to see if the embedding of unlabelled data in the semi-supervised setting leads to any improvement.

I have tried to mask unlabelled data in y with np.nan values which seemed to be supported by setting force_finite=False. Using toy datasets, I obtained reasonably-looking embeddings in this way.

However, for larger datasets, a ValueError is thrown. This is because the current implementation uses a hard-coded threshold at n_samples=4096. Below this threshold, pairwise distances are explicitly computed using sklearn.metrics.pairwise_distances(), which supports missing data. Above this threshold however, pyNNDescent is used for approximating nearest neighbors, which does not seem to be able to deal with missing values.

I tried to make my way around this limitation by editing the source code to increase the aforementioned threshold, in order to force the usage of pairwise_distances. However, the latter works in quadratic time and gets extremely slow with 10^4 data.

My questions are:

  • Could you please confirm that by masking unknown continuous 'labels' in y with np.nan, the resulting embeddings are correctly computed by using UMAP's semi-supervised learning?
  • If the answer to the above question is 'no', then what is the correct way of doing so?
  • If the answer to the above question is 'yes', then could you suggest a better way of dealing with missing values in the nearest neighbor computation?

Thank you. -- Istvan

idekany avatar Jul 27 '23 15:07 idekany

Did you make any progress on this? I would also like to do semi-supervised regression, and the docs still don't say anything about it. 🙁

e-pet avatar Mar 25 '25 21:03 e-pet