tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

Tensors are leaked when `model.save()` includes the optimizer

Open Vectorrent opened this issue 10 months ago • 4 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): False
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
  • TensorFlow.js installed from (npm or script link): NPM
  • TensorFlow.js version (use command below): 4.17.0

Describe the current behavior When using tensorflow-node-gpu for training, I periodically save models to disk. However, my training has been crashing, and I've just learned why:

When model.save() includes the optimizer, a single tensor is leaked. This leads to the slow accumulation of unnecessary tensors, and crashes my computer after some amount of time:

await model.save(`file://saved_model`, { includeOptimizer: true })

To be clear, this is before saving a model:

{ unreliable: true, numTensors: 18, numDataBuffers: 18, numBytes: 420 }

And this is after:

{ unreliable: true, numTensors: 19, numDataBuffers: 19, numBytes: 424 }

Describe the expected behavior I would expect model-saving to dispose of all unused tensors, after the operation is complete.

Standalone code to reproduce the issue This bug is 100% reproducible in both tfjs-node and tfjs-node-gpu:

import fs from 'fs'
import * as tf from '@tensorflow/tfjs-node'

const model = tf.sequential()
model.add(tf.layers.dense({ units: 10, inputShape: [1] }))
model.add(tf.layers.dense({ units: 1 }))

model.compile({
    optimizer: 'adam',
    loss: 'meanSquaredError'
})

const xs = tf.tensor2d([1, 2, 3, 4], [4, 1])
const ys = tf.tensor2d([2, 4, 6, 8], [4, 1])

fs.mkdirSync('./saved_model', { recursive: true })

model.fit(xs, ys, {
    epochs: Infinity,
    verbose: 0,
    callbacks: {
        onEpochEnd: async (epoch, logs) => {
            console.clear()
            console.log(epoch)
            console.log(tf.memory())
            if (epoch % 1000 === 0 && epoch !== 0) {
                await model.save(`file://saved_model`, {
                    includeOptimizer: true
                })
            }
        }
    }
})

Other info / logs

  • There are no logs to provide, because TFJS OOM issues cause my computer to hard-freeze; they require a forcible shutdown to recover from.
  • If the includeOptimizer flag is disabled, then this does not occur.

Vectorrent avatar Apr 10 '24 13:04 Vectorrent

Hi, @Vectorrent

Thank you for bringing this issue to our attention and I was trying to replicate the same behaviour from my end on my macOS and I'm getting below output with includeOptimizer: true flag and as you mentioned that issue not happening with includeOptimizer: false so I also observed same thing so workaround is either disable includeOptimizer flag when saving the model. This avoids saving the optimizer state, preventing the leak. However, you'll need to recreate the optimizer during model loading or TensorFlow.js provides functions for manual memory management. You can try the following approach after each save please refer official documentation for tf.tidy and tf.dispose

await model.save(`file://saved_model`, { includeOptimizer: true });

// Manually dispose of the optimizer
model.optimizer.dispose();

// Dispose of other unused tensors
tf.dispose(xs);
tf.dispose(ys);

image

Please let me know if I have missed anything here. Thank you for your cooperation and patience.

gaikwadrahul8 avatar Apr 10 '24 21:04 gaikwadrahul8

Thanks for the quick response. Sadly, tf.tidy() has no effect and tf.dispose() crashes my training session (for obvious reasons). So, neither of these are a "solution" and we should probably fix the underlying bug in the library. I might have some time to dig into the TFJS code and troubleshoot that, at some point.

Until then, my solution is to 1) create a manual training loop, 2) save the model, 3) unload the model, 4) re-load the model, 5) resume training. Not a great solution, if you ask me :rofl:

Vectorrent avatar Apr 10 '24 22:04 Vectorrent

I cannot for the life of me figure out how to build TFJS locally on my computer, so I'm not really able to debug or test this properly. Regardless, I've been digging, and this is probably where we need to apply a fix: https://github.com/tensorflow/tfjs/blob/master/tfjs-layers/src/engine/training.ts#L2146

If I had to guess, maybe its related to the use of io.concatenateArrayBuffers here? Apparently, it's deprecated and we should be using tf.io.CompositeArrayBuffer.join() instead.

Vectorrent avatar Apr 11 '24 19:04 Vectorrent

I wrapped the saving of the model in tf.engine().startScope() and tf.engine().endScope() to prevent the leaking tensor.

mightyplow avatar Aug 13 '24 06:08 mightyplow