Semi-supervised dimensionality reduction with continuous-valued labels (regression problem)

Open idekany opened this issue 2 years ago • 1 comments

Hi, From the documentation, I understand how UMAP supports semi-supervised problems if the target_metric is categorical (default), i.e., by masking unlabelled data with -1. I have used this feature with success.

I would also like to use the semi-supervised functionality with a continuous-valued response variable of a regression problem (i.e., with "floating-point labels"). The goal is to compute good embeddings to be used as features by a downstream predictive model. Using the metric learning functionality in this case with target_metric='l2' has led to relative success in certain scenarios of train-test splitting, but I would like to see if the embedding of unlabelled data in the semi-supervised setting leads to any improvement.

I have tried to mask unlabelled data in y with np.nan values which seemed to be supported by setting force_finite=False. Using toy datasets, I obtained reasonably-looking embeddings in this way.

However, for larger datasets, a ValueError is thrown. This is because the current implementation uses a hard-coded threshold at n_samples=4096. Below this threshold, pairwise distances are explicitly computed using sklearn.metrics.pairwise_distances(), which supports missing data. Above this threshold however, pyNNDescent is used for approximating nearest neighbors, which does not seem to be able to deal with missing values.

I tried to make my way around this limitation by editing the source code to increase the aforementioned threshold, in order to force the usage of pairwise_distances. However, the latter works in quadratic time and gets extremely slow with 10^4 data.

My questions are:

Could you please confirm that by masking unknown continuous 'labels' in y with np.nan, the resulting embeddings are correctly computed by using UMAP's semi-supervised learning?
If the answer to the above question is 'no', then what is the correct way of doing so?
If the answer to the above question is 'yes', then could you suggest a better way of dealing with missing values in the nearest neighbor computation?

Thank you. -- Istvan

Jul 27 '23 15:07 idekany

Did you make any progress on this? I would also like to do semi-supervised regression, and the docs still don't say anything about it. 🙁

Mar 25 '25 21:03 e-pet