tf-keras GPU memory usage depends on data size

I am cross-posting this bug from Tensorflow because I suspect it is due to Keras.

It seems that model.fit and model.predict, when given numpy arrays, try to load the whole dataset in the GPU, instead of just sending minibatches. The bug appears on versions 2.6-2.10, but not on 2.5.

System information.

OS Platform and Distribution: Fedora 36
TensorFlow installed from (source or binary): binary wheel
TensorFlow version (use command below): 2.6-2.10.
Python version: 3.6
GPU model and memory: RTX 2070, 8 GiB.
Exact command to reproduce: main(big=False) works, but main(big=True) doesn't.

Standalone code to reproduce the issue.

The following code trains a simple model on either 15 MiB of data, or 15 GiB. The computer has 32 GiB, so it can hold it, but it quickly fills the GPU.

import numpy as np

from keras import layers, models


def get_model(n_inputs: int) -> models.Model:
    inp = layers.Input(shape=(n_inputs,))

    out = layers.Dense(n_inputs, activation='linear')(inp)
    m = models.Model(inputs=inp, outputs=out)
    m.compile(loss='mse', optimizer='adam')
    m.summary()
    return m


def main(big: bool):
    model = get_model(4096)

    N = 1_000_000 if big else 1_000
    train_data = np.zeros((N, 4096), dtype=np.float32)
    print(f'Evaluating on {train_data.shape[0]} data points and {train_data.nbytes / 1024**2} MiB.')

    model.predict(train_data, batch_size=16, verbose=1)
    # Also:
    model.fit(train_data, train_data, epochs=3, verbose=1, batch_size=16, max_queue_size=1)

Using tf.data doesn't solve the problem:

    def wrap_data(data: np.ndarray) -> tf.data.Dataset:
        dataset = tf.data.Dataset.from_tensor_slices(data)
        shuffled = dataset.shuffle(buffer_size=5, reshuffle_each_iteration=True)
        batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
        autoencoder = batched.map(lambda x: (x, x)).prefetch(5)

        return autoencoder

    train_data_iterator = wrap_data(train_data)
    model.fit(train_data_iterator, epochs=3, verbose=1)

Using cuda_malloc_async doesn't fix it.

Expected behaviour

I'd expect only the current minibatch to be transferred to the GPU, and not needing to hold the full dataset.

Source code / logs. Here is the error:

-09-06 10:52:22.131172: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst

** Note ** I was asked to reproduce it in a Colab instance. At least the ones I have access to have less RAM than GPU vRAM, so this problem cannot be reproduced there.

Sep 08 '22 10:09 Dapid

I see both have been assigned to sushreebarsa. Sorry if it is too noisy!

Sep 08 '22 12:09 Dapid

I have run into this bug as well while training a Keras model with a large dataset.

After some tinkering, I found another way to replicate the same error -- simply creating a constant tensor. For some reason, Tensorflow defaults to placing the tensor on GPU.

tf.constant(train_data)

Explicitly setting the device to CPU resolves this issue:

with tf.device('/CPU:0'):
    train_data = tf.constant(train_data)

When training a Keras model, all inputs must be CPU-placed Tensors to work:

with tf.device('/CPU:0'):
    train_data = tf.constant(train_data)

model.predict(train_data)
model.fit(train_data, train_data)

In my environment, the above snippet seems to train the model on GPU (as it should) without copying the full dataset to GPU. I would say that there is a need to investigate Keras codebase to make sure that device placements are correct.

[Update: Fixed an error in the last snippet as reported by @timsharpzim]

Sep 20 '22 15:09 hellodanylo

I have run into this bug as well while training a Keras model with a large dataset.

I'm also experiencing the same bug. For me, the way to workaround involved getting the indenting right vs hellodanylo's comment - only the tensor creation should be in the 'with cpu'

with tf.device('/CPU:0'): ____xdata = tf.convert_to_tensor(xdata) hist = model.fit(xdata, ......)

[edit - I can't get the code/indent to show correctly, so I used underscores to show it]

Sep 23 '22 04:09 timsharpzim

only the tensor creation should be in the 'with cpu'

@timsharpzim You are right, that's an error in my original snippet.

To summarize:

If you pass NumPy arrays to Model.fit/predict, it will create new Tensors placed on the GPU.
If you pass CPU-placed Tensors to Model.fit/predict, it will use them without copying to the GPU -- which is the desired behavior.

Sep 23 '22 04:09 hellodanylo

I think I'm getting this same problem, except that explicitly putting the training data onto the CPU doesn't seem to solve it.

GPU memory usage is low until model.fit(...), at which point it fills up (if the training data is smallish) or crashes (if it's big).

Dec 01 '22 18:12 aselker

I think I'm getting this same problem, except that explicitly putting the training data onto the CPU doesn't seem to solve it.

GPU memory usage is low until model.fit(...), at which point it fills up (if the training data is smallish) or crashes (if it's big).

Same issue I'm having

Jan 05 '23 02:01 The-Vheed

i am having the same issue. if you pass numpy array to .fit / .predict, TF copies all data in GPU while imho default behavior should be keep the data in memory and copy only minibatches

Mar 09 '23 08:03 levent2100

Having same issue. Also I've seen posts saying its a TF 2.6-2.10 bug. Currently running on TF 2.12 and issue persists.

May 10 '23 18:05 amberT15

I have run into this bug as well while training a Keras model with a large dataset.

I'm also experiencing the same bug. For me, the way to workaround involved getting the indenting right vs hellodanylo's comment - only the tensor creation should be in the 'with cpu'

with tf.device('/CPU:0'): ____xdata = tf.convert_to_tensor(xdata) hist = model.fit(xdata, ......)

[edit - I can't get the code/indent to show correctly, so I used underscores to show it]

I saw this today, where the load all onto GPU happened after a few model fits that worked - and the above appears to ahve fixed it.

Windows 11

tensorflow                2.10.0                   pypi_0    pypi

Jun 07 '23 00:06 RichardScottOZ

This issue was introduced after verion 2.6.0. Tensorflow 2.5.3 worked fine.

Jun 23 '24 10:06 nonamestreet

The problem persists on 2.17.0 / 3.5.0.

A solution is for fit to convert the input to a constant allocated on CPU if it is passed a numpy array. This allows it to work out of the box, but if the data is small and performance is critical, the user can pre-allocate it in the GPU.

Aug 13 '24 10:08 Dapid