umap icon indicating copy to clipboard operation
umap copied to clipboard

Nan during GPU training

Open DBence17 opened this issue 7 months ago • 6 comments

Dear Developers,

I'm encountering an issue during Parametric UMAP training on GPU: the loss becomes NaN, even when using the official demo notebooks without any modifications. My setup:

  • GPU: NVIDIA GeForce RTX 4090
  • CUDA/cuDNN: Detected and properly loaded
  • TensorFlow version: I have tried everything from 2.16 to recent (currently 2.19 I think)
  • I use the most recent version of umap-learn

The training works fine when I force the model to use the CPU. On GPU however, the training starts normally, but quickly diverges and logs loss: nan by the first epoch. I have tried ta few things based on other issues and discussions:

  • Lowering the learning rate (e.g., 1e-3, 1e-4)
  • Disabling XLA (tf.config.optimizer.set_jit(False))
  • Disabling mixed precision
  • Using .astype('float64') for fit_transform
  • Reducing the batch size

Despite all of the above, the issue persists on GPU. Any guidance on resolving this would be appreciated.

Best regards

DBence17 avatar May 06 '25 08:05 DBence17

I can confirm that this is an issue in colab with the most recent updates.

https://colab.research.google.com/drive/1QuZrOUDHaOg2VnL-RoeVjLUYsFNJj2Z4

There have been a number of updates from @AMS-Hippo and @jacobgolding - any idea what is causing this

timsainb avatar May 06 '25 19:05 timsainb

I can confirm that this is an issue in colab with the most recent updates.

https://colab.research.google.com/drive/1QuZrOUDHaOg2VnL-RoeVjLUYsFNJj2Z4

There have been a number of updates from @AMS-Hippo and @jacobgolding - any idea what is causing this

Thanks for taking a look. On an initial skim, no idea. I'll poke again if I have time tomorrow or the next few days.

Boring details: My changes were quite small, and all related to @jacobgolding's nice work on landmarks. I don't think any of these changes should run at all if you don't explicitly add landmarks (via the add_landmark function or by setting internals), and I just checked that the associated internals they check are never set to not-None on the colab code you shared.

AMS-Hippo avatar May 06 '25 21:05 AMS-Hippo

I played around a bit with checking out previous commits going back before these updates and the problem persists, so maybe this has to do with incompatibility with new tensorflow or keras versions.

timsainb avatar May 06 '25 21:05 timsainb

I think this may be the same issue https://github.com/lmcinnes/umap/issues/1180

JGSweets avatar Jun 26 '25 02:06 JGSweets

I tried to do some debugging on this today, and still ended up with more questions than answers. I have access to two different environments at the moment (linux VM, mac with M2). All tests were done with the MNIST_Landmarks notebook as of commit f123b91 (before @AMS-Hippo added the helper functions for landmarks), with the most recent version of umap otherwise (umap==0.5.9.post2).

On my mac I had python==3.12.11, keras==3.3.3 through 3.11, tensorflow 2.19 On the linux vm, python==3.12.2, keras==3.5 through 3.11, tensorflow 2.17

The main thing I have to add is that I did not experience this at all on linux with cpu or mac with gpu (M2 chip) - however I experienced it intermittently with the cpu on mac. This is baffling to me, I'd only ever seen this when trying to use GPU before. It also appeared to be non-deterministic, with successive runs succeeding or failing with no environment changes.

It only seemed to happen when the cpu usage was maxed out - not when I ran it with fewer data points. This may be why I couldn't get it to occur on the GPU - I haven't tried it on a large enough data set to use all the GPU yet.

I also could not get it to occur by just re-training the model - it only happened when re-training with landmarks.

I'll keep tinkering on this when I have time, but wanted to share some info from what I have tried so far.

jacobgolding avatar Jul 31 '25 11:07 jacobgolding

Made some progress on this today. There seems to be an issue with some of the persistent states of the parametric_model, causing exploding gradients (in certain, non-deterministic ways). For now, I have a temporary work around - save the weights and certain parameters of the model off, make a new ParametricUMAP instance, and load the weights back in along with some other steps done in _fit_embed_data.

# Save the weights off
#
weights = p_embedder.parametric_model.get_weights()
_a = p_embedder._a
_b = p_embedder._b
negative_sample_rate = p_embedder.negative_sample_rate

# Make a new embedder
#
p_embedder = ParametricUMAP()

# Re-construct the model.
#
p_embedder.dims = [np.shape(x1)[-1]]
n_data = len(x1)
init_embedding = None

p_embedder.encoder, p_embedder.decoder = prepare_networks(
    p_embedder.encoder,
    p_embedder.decoder,
    p_embedder.n_components,
    p_embedder.dims,
    n_data,
    p_embedder.parametric_reconstruction,
    init_embedding,
)
 
p_embedder._a = _a
p_embedder._b = _b
p_embedder.negative_sample_rate = negative_sample_rate
p_embedder._define_model()

p_embedder.parametric_model.set_weights(weights)

Tests I've done: Train, continue training w/o landmarks (works) Train, train w landmarks (breaks) Train, make new embedder as above, train w landmarks (works) Train, make new embedder, continue w/o landmakrs, train w landmarks (breaks)

Next up - trying to work out if I can just replace the parametric_model, not the whole class, and then working out what is persisting that's causing issues.

jacobgolding avatar Aug 04 '25 07:08 jacobgolding