Nan during GPU training
Dear Developers,
I'm encountering an issue during Parametric UMAP training on GPU: the loss becomes NaN, even when using the official demo notebooks without any modifications. My setup:
- GPU: NVIDIA GeForce RTX 4090
- CUDA/cuDNN: Detected and properly loaded
- TensorFlow version: I have tried everything from 2.16 to recent (currently 2.19 I think)
- I use the most recent version of umap-learn
The training works fine when I force the model to use the CPU. On GPU however, the training starts normally, but quickly diverges and logs loss: nan by the first epoch.
I have tried ta few things based on other issues and discussions:
- Lowering the learning rate (e.g.,
1e-3,1e-4) - Disabling XLA (
tf.config.optimizer.set_jit(False)) - Disabling mixed precision
- Using .astype('float64') for fit_transform
- Reducing the batch size
Despite all of the above, the issue persists on GPU. Any guidance on resolving this would be appreciated.
Best regards
I can confirm that this is an issue in colab with the most recent updates.
https://colab.research.google.com/drive/1QuZrOUDHaOg2VnL-RoeVjLUYsFNJj2Z4
There have been a number of updates from @AMS-Hippo and @jacobgolding - any idea what is causing this
I can confirm that this is an issue in colab with the most recent updates.
https://colab.research.google.com/drive/1QuZrOUDHaOg2VnL-RoeVjLUYsFNJj2Z4
There have been a number of updates from @AMS-Hippo and @jacobgolding - any idea what is causing this
Thanks for taking a look. On an initial skim, no idea. I'll poke again if I have time tomorrow or the next few days.
Boring details: My changes were quite small, and all related to @jacobgolding's nice work on landmarks. I don't think any of these changes should run at all if you don't explicitly add landmarks (via the add_landmark function or by setting internals), and I just checked that the associated internals they check are never set to not-None on the colab code you shared.
I played around a bit with checking out previous commits going back before these updates and the problem persists, so maybe this has to do with incompatibility with new tensorflow or keras versions.
I think this may be the same issue https://github.com/lmcinnes/umap/issues/1180
I tried to do some debugging on this today, and still ended up with more questions than answers. I have access to two different environments at the moment (linux VM, mac with M2). All tests were done with the MNIST_Landmarks notebook as of commit f123b91 (before @AMS-Hippo added the helper functions for landmarks), with the most recent version of umap otherwise (umap==0.5.9.post2).
On my mac I had python==3.12.11, keras==3.3.3 through 3.11, tensorflow 2.19 On the linux vm, python==3.12.2, keras==3.5 through 3.11, tensorflow 2.17
The main thing I have to add is that I did not experience this at all on linux with cpu or mac with gpu (M2 chip) - however I experienced it intermittently with the cpu on mac. This is baffling to me, I'd only ever seen this when trying to use GPU before. It also appeared to be non-deterministic, with successive runs succeeding or failing with no environment changes.
It only seemed to happen when the cpu usage was maxed out - not when I ran it with fewer data points. This may be why I couldn't get it to occur on the GPU - I haven't tried it on a large enough data set to use all the GPU yet.
I also could not get it to occur by just re-training the model - it only happened when re-training with landmarks.
I'll keep tinkering on this when I have time, but wanted to share some info from what I have tried so far.
Made some progress on this today. There seems to be an issue with some of the persistent states of the parametric_model, causing exploding gradients (in certain, non-deterministic ways). For now, I have a temporary work around - save the weights and certain parameters of the model off, make a new ParametricUMAP instance, and load the weights back in along with some other steps done in _fit_embed_data.
# Save the weights off
#
weights = p_embedder.parametric_model.get_weights()
_a = p_embedder._a
_b = p_embedder._b
negative_sample_rate = p_embedder.negative_sample_rate
# Make a new embedder
#
p_embedder = ParametricUMAP()
# Re-construct the model.
#
p_embedder.dims = [np.shape(x1)[-1]]
n_data = len(x1)
init_embedding = None
p_embedder.encoder, p_embedder.decoder = prepare_networks(
p_embedder.encoder,
p_embedder.decoder,
p_embedder.n_components,
p_embedder.dims,
n_data,
p_embedder.parametric_reconstruction,
init_embedding,
)
p_embedder._a = _a
p_embedder._b = _b
p_embedder.negative_sample_rate = negative_sample_rate
p_embedder._define_model()
p_embedder.parametric_model.set_weights(weights)
Tests I've done: Train, continue training w/o landmarks (works) Train, train w landmarks (breaks) Train, make new embedder as above, train w landmarks (works) Train, make new embedder, continue w/o landmakrs, train w landmarks (breaks)
Next up - trying to work out if I can just replace the parametric_model, not the whole class, and then working out what is persisting that's causing issues.