tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

Tensors are leaked when `model.save()` includes the optimizer

Open Vectorrent opened this issue 4 months ago • 4 comments

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): False
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
  • TensorFlow.js installed from (npm or script link): NPM
  • TensorFlow.js version (use command below): 4.17.0

Describe the current behavior When using tensorflow-node-gpu for training, I periodically save models to disk. However, my training has been crashing, and I've just learned why:

When model.save() includes the optimizer, a single tensor is leaked. This leads to the slow accumulation of unnecessary tensors, and crashes my computer after some amount of time:

await model.save(`file://saved_model`, { includeOptimizer: true })

To be clear, this is before saving a model:

{ unreliable: true, numTensors: 18, numDataBuffers: 18, numBytes: 420 }

And this is after:

{ unreliable: true, numTensors: 19, numDataBuffers: 19, numBytes: 424 }

Describe the expected behavior I would expect model-saving to dispose of all unused tensors, after the operation is complete.

Standalone code to reproduce the issue This bug is 100% reproducible in both tfjs-node and tfjs-node-gpu:

import fs from 'fs'
import * as tf from '@tensorflow/tfjs-node'

const model = tf.sequential()
model.add(tf.layers.dense({ units: 10, inputShape: [1] }))
model.add(tf.layers.dense({ units: 1 }))

model.compile({
    optimizer: 'adam',
    loss: 'meanSquaredError'
})

const xs = tf.tensor2d([1, 2, 3, 4], [4, 1])
const ys = tf.tensor2d([2, 4, 6, 8], [4, 1])

fs.mkdirSync('./saved_model', { recursive: true })

model.fit(xs, ys, {
    epochs: Infinity,
    verbose: 0,
    callbacks: {
        onEpochEnd: async (epoch, logs) => {
            console.clear()
            console.log(epoch)
            console.log(tf.memory())
            if (epoch % 1000 === 0 && epoch !== 0) {
                await model.save(`file://saved_model`, {
                    includeOptimizer: true
                })
            }
        }
    }
})

Other info / logs

  • There are no logs to provide, because TFJS OOM issues cause my computer to hard-freeze; they require a forcible shutdown to recover from.
  • If the includeOptimizer flag is disabled, then this does not occur.

Vectorrent avatar Apr 10 '24 13:04 Vectorrent