tfjs icon indicating copy to clipboard operation
tfjs copied to clipboard

Memory Leak During Prediction with `{training: true}`

Open AmitMY opened this issue 1 year ago • 5 comments

System information

  • TensorFlow.js version: 4.10.0

Describe the current behavior There has been lots of discussion about model prediction with correct batchnorm norms (https://github.com/tensorflow/tfjs/issues/3152) which comes down to:

model.apply(tensor, {training: true})

Instead of

model.apply(tensor)

In my specific model, this is leaking 2 tensors

Describe the expected behavior

No memory leaks regardless of training situation.

Standalone code to reproduce the issue

Fully standalone: https://drive.google.com/file/d/1DB9UzPDpZ8umTwsA8dvlWElCkU2ykLyw/view?usp=sharing (to reproduce, I ran http-server .)

const model = await tf.loadLayersModel('model.json');

const tensor = tf.zeros([1, 1, 256, 256, 3]).toFloat();
tensor.print();

// Must apply model in training=True mode to avoid using aggregated norm statistics
const beforeTensors = tf.memory().numTensors
const pred = tf.tidy(() => model.apply(tensor, {training: true}));
console.log('leaking', tf.memory().numTensors - beforeTensors - 1, 'tensors');
// leaking 2 tensors

const beforeTensors2 = tf.memory().numTensors
const pred2 = tf.tidy(() => model.apply(tensor));
console.log('leaking', tf.memory().numTensors - beforeTensors2 - 1, 'tensors');
// leaking 0 tensors

AmitMY avatar Oct 02 '23 19:10 AmitMY

Hi, @AmitMY

Thank you for bringing this issue to our attention and I tried to replicate the same issue from my end and I'm also getting the same result which you've mentioned above even I tried with previous versions @tensorflow/[email protected], @tensorflow/[email protected] but issue still exists so we'll have to dig more into this issue and will update you soon. Thank you!

I have added screenshot below for reference :

image

gaikwadrahul8 avatar Oct 03 '23 09:10 gaikwadrahul8

There has always been a memory leak, I have a training cycle, the same thing over and over again, each time the size of the program in memory increases, and so does the execution time. Take the simplest model training, any, put it in a loop for 12 hours, and everything will be visible.

borodadada avatar Nov 04 '23 15:11 borodadada

Hi, @AmitMY, @borodadada

I apologize for the delayed response and as per my current understanding when the model.apply() function is called with {training: true}, it internally creates new tensors that are not disposed of within the tidy scope.

This leads to a memory leak because the intermediate tensors created by model.apply() with {training: true} are not cleaned up properly. As a result, the number of tensors in memory increases, leading to memory exhaustion over time if this pattern is repeated and to avoid the memory leak, you can modify the code to explicitly dispose of the tensors created by model.apply() with {training: true} using the tf.dispose() function.

If I have missed something here please let me know. Thank you.

async function main() {
        const model = await tf.loadLayersModel('web_model/model.json');
        const tensor = tf.zeros([1, 1, 256, 256, 3]).toFloat();
        tensor.print();

        const beforeTensors = tf.memory().numTensors;
        const pred = tf.tidy(() => {
            const prediction = model.apply(tensor, { training: true });
            tf.dispose([tensor, prediction]); // Dispose of intermediate tensors
            return prediction;
        });
        console.log('leaking', tf.memory().numTensors - beforeTensors - 1, 'tensors');
        // leaking 0 tensors
         }
         
    main();

gaikwadrahul8 avatar Dec 29 '23 16:12 gaikwadrahul8

While I understand your solution - I think that addressing it explicitly will cause problems for other users (as in, it will be fixed for me, but others might encounter invisible memory leaks until they might find this specific issue)

Is there no way to fix it in the core?

AmitMY avatar Dec 30 '23 12:12 AmitMY

why can’t you add an initialization function, run it every time after training is completed, without input data, just a function, or I don’t understand something

borodadada avatar Dec 30 '23 13:12 borodadada