umap Nans when following Re-training Parametric UMAP with landmarks tutorial

Hey umap team!

Firstly, a big thanks for all the work on this library, it is incredibly useful! The ability to retrain a ParametricUMAP whilst preserving the mapping for embeddings that have already been processed would be incredible.

I tried this out for my own use case, using the example here on umap-learn as a reference. However, when it came to the retraining phase, the reported loss for each epoch is always nan.

I assumed this was an issue with my own setup, so I copied the example verbatim. Unfortunately I get the exact same outcome. The model does not retrain successfully.

p_embedder.fit(x2_lmk, landmark_positions=landmarks)
Epoch 1/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 21s 5ms/step - loss: nan
Epoch 2/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 3/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 4/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 5/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 20s 5ms/step - loss: nan
Epoch 6/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 7/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 8/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 9/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan
Epoch 10/10
3921/3921 ━━━━━━━━━━━━━━━━━━━━ 19s 5ms/step - loss: nan

I suspect there has either been some kind of regression or there have been some updates to the library that are not reflected in the example.

Any help or suggestions would be greatly appreciated. Cheers!

Jan 24 '25 11:01 EHenryPega

this is related to https://github.com/lmcinnes/umap/pull/1153 maybe @jacobgolding knows what is going on, I don't have much experience with landmarks yet

Feb 01 '25 19:02 timsainb

Hello! I think this might be related to changes in #1156 , it looks like the documentation hasn't been updated to reflect the new helper functions for adding landmarks. I've set aside some time in the next couple of days to make sure this is the issue, and remedy it. In the meantime, give the notebook a try instead of the code in the docco: https://github.com/lmcinnes/umap/blob/a012b9d8751d98b94935ca21f278a54b3c3e1b7f/notebooks/MNIST_Landmarks.ipynb

Feb 04 '25 09:02 jacobgolding

Thanks for the reply. I did notice that there were some nice new helper functions in that notebook which make life a lot simpler!

Unfortunately, I still ran into the same issues when using these.

I have been able to run the notebooks successfully on a remote machine. As far as I can tell, the issue is related to my laptop using an M3 chip. I have tried many different tensor flow libraries, from vanilla to those suggested here https://github.com/ChaitanyaK77/Initializing-TensorFlow-Environment-on-M3-M3-Pro-and-M3-Max-Macbook-Pros.

Unfortunately I always end up with NaNs for loss and a broken model when fitting the model using landmarks.

Feb 06 '25 08:02 EHenryPega

After some testing today I mostly just confused myself. I found a couple of things:

When I first ran the notebook as is on the most recent version from my fork I encountered the same issue as you (on an M2 chip).
scikit-learn has updated their check_array function, specifically changing force_all_finite to ensure_all_finite. This is going to be a breaking change with 1.8 for UMAP as a whole, so there's work to be done to prepare for that (@lmcinnes )
Upgrading to scikit-learn 1.6 (the most recent version at the moment) temporarily fixed the nans on re-training, but not consistently. I can re-run the same code and get either something that works or something that doesn't.

Unfortunately I won't have much more of a chance to debug this in the near future. The next thing I would try is investigating the default landmark loss function and see what's going on there, perhaps using ops.subtract.

Feb 06 '25 12:02 jacobgolding

I seem to have encountered a similar problem and found a way to solve it. If my assumptions are correct, the loss produces nan values due to a lack of precision in the calculations. Converting the landmark array to 64-bit floating-point numbers allowed me to solve the problem.

# Set landmark loss weight and continue training our Parametric UMAP model.
p_embedder.landmark_loss_weight = 0.01
p_embedder.fit(x2_lmk.astype('float64'), landmark_positions=landmarks.astype('float64'))
p_emb2_x2 = p_embedder.transform(x2)

Apr 10 '25 07:04 joachimpoutaraud

I have tried joachimpoutaraud's solution above, but unfortunately am still getting the nan loss.

Apr 30 '25 05:04 jacobgolding