umap icon indicating copy to clipboard operation
umap copied to clipboard

When transforming new data using an existing UMAP object, output array is entirely NaNs

Open krauthammera opened this issue 6 years ago • 5 comments

I have two UMAP transform objects that I've "trained" on two separate datasets, one 8000x512 and one 8000x1024 in dimension. Both of these transform objects produce a successful and reasonable embedding in 2 dimensions using the training set. I have an additional 160k rows of data for each of the two feature sizes (512 and 1024), which I am using as "test" data. When I transform the 160k x 512 array into 160k x 2, there is no issue. However, the content of the 160k x 2 array that is produced when I transform the 160k x 1024 input array is entirely NaNs. I have already verified that the inputs do not contain any NaNs or infinities. Additionally, taking a smaller subset of the input set, such as 10 rows, also produces a 10x2 output that is full of NaNs. Is there any reason why this might be the case? I'm wondering if perhaps there is an inherent size limit to the number of features for new data that is not a constraint on the original data.

krauthammera avatar Oct 22 '19 20:10 krauthammera

I also tried using a new input numpy array of random floats in the same range as my test input, and encountered the same issue as below:

test_rand_input = np.random.uniform(low=0.0, high=np.max(umap_test_set_dense), size=(100, 1024)) test_rand_output = umap_model_25nn_dense.transform(test_rand_input) print(test_rand_output)

Getting this as the output:

[[nan nan] [nan nan] [nan nan] [nan nan] [nan nan] [nan nan] [nan nan] [nan nan] [nan nan] [nan nan]]

krauthammera avatar Oct 22 '19 21:10 krauthammera

There were definitely some bugs in earlier versions of UMAP that could cause this to happen. Do you know what version you are using? Can you reproduce the problem on a build of umap directly from the current master?

lmcinnes avatar Oct 23 '19 14:10 lmcinnes

Hello! 👋

I am having the same problem with version 0.5.7. I am able to use .fit_transform as expected but when I try to .transform new data points they all come back as NaNs.

My model config is as follows:

embed_pipeline = umap.UMAP(
    n_neighbors=10,
    n_components=2,
    metric="mahalanobis",
    random_state=42,
    transform_seed=42,
    verbose=True,
    n_epochs=200,
    min_dist=0.1
)

Any ideas how to fix?

Update: shuffling my data around, I now get NaNs during .fit_transform it appears to be happening during the KNN step:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-a8b2cb68-52ac-4776-a28d-9c1d4f84c0d5/lib/python3.11/site-packages/umap/umap_.py:1952: UserWarning: n_jobs value 1 overridden to 1 by setting random_state. Use no seed for parallelism.
  warn(
UMAP(metric='mahalanobis', n_epochs=200, n_jobs=1, n_neighbors=10, random_state=42, verbose=True)
Wed May 14 16:38:38 2025 Construct fuzzy simplicial set
Wed May 14 16:39:25 2025 Finding Nearest Neighbors
Wed May 14 16:39:27 2025 Finished Nearest Neighbor Search
/local_disk0/.ephemeral_nfs/envs/pythonEnv-a8b2cb68-52ac-4776-a28d-9c1d4f84c0d5/lib/python3.11/site-packages/umap/umap_.py:576: RuntimeWarning: overflow encountered in cast
  knn_dists = knn_dists.astype(np.float32)

I am using numpy=='1.26.4'

And, in case you were curious, no NaNs are present in my dataset 😅

degan-wbd avatar May 14 '25 16:05 degan-wbd

My best guess is that the mahalanobis distance is causing some issues somehow. Can you reproduce the result with, say, eucliudean distance?

lmcinnes avatar May 15 '25 03:05 lmcinnes

Hmmm sure enough that did fix it. Any idea why mahalanobis would do this for new points? Off the dome, I would guess something in calculating the covariance matrix...

degan-wbd avatar May 19 '25 16:05 degan-wbd