keras GPU memory usage depends on data size

I am cross-posting this bug from Tensorflow because I suspect it is due to Keras.

It seems that model.fit and model.predict, when given numpy arrays, try to load the whole dataset in the GPU, instead of just sending minibatches. The bug appears on versions 2.6-2.10, but not on 2.5.

System information.

OS Platform and Distribution: Fedora 36
TensorFlow installed from (source or binary): binary wheel
TensorFlow version (use command below): 2.6-2.10.
Python version: 3.6
GPU model and memory: RTX 2070, 8 GiB.
Exact command to reproduce: main(big=False) works, but main(big=True) doesn't.

Standalone code to reproduce the issue.

The following code trains a simple model on either 15 MiB of data, or 15 GiB. The computer has 32 GiB, so it can hold it, but it quickly fills the GPU.

import numpy as np

from keras import layers, models


def get_model(n_inputs: int) -> models.Model:
    inp = layers.Input(shape=(n_inputs,))

    out = layers.Dense(n_inputs, activation='linear')(inp)
    m = models.Model(inputs=inp, outputs=out)
    m.compile(loss='mse', optimizer='adam')
    m.summary()
    return m


def main(big: bool):
    model = get_model(4096)

    N = 1_000_000 if big else 1_000
    train_data = np.zeros((N, 4096), dtype=np.float32)
    print(f'Evaluating on {train_data.shape[0]} data points and {train_data.nbytes / 1024**2} MiB.')

    model.predict(train_data, batch_size=16, verbose=1)
    # Also:
    model.fit(train_data, train_data, epochs=3, verbose=1, batch_size=16, max_queue_size=1)

Using tf.data doesn't solve the problem:

    def wrap_data(data: np.ndarray) -> tf.data.Dataset:
        dataset = tf.data.Dataset.from_tensor_slices(data)
        shuffled = dataset.shuffle(buffer_size=5, reshuffle_each_iteration=True)
        batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
        autoencoder = batched.map(lambda x: (x, x)).prefetch(5)

        return autoencoder

    train_data_iterator = wrap_data(train_data)
    model.fit(train_data_iterator, epochs=3, verbose=1, max_queue_size=1)

Using cuda_malloc_async doesn't fix it.

Expected behaviour

I'd expect only the current minibatch to be transferred to the GPU, and not needing to hold the full dataset.

Source code / logs. Here is the error:

-09-06 10:52:22.131172: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst

** Note ** I was asked to reproduce it in a Colab instance. At least the ones I have access to have less RAM than GPU vRAM, so this problem cannot be reproduced there.

Sep 08 '22 10:09 Dapid

I see both have been assigned to sushreebarsa. Sorry if it is too noisy!

Sep 08 '22 12:09 Dapid

I have run into this bug as well while training a Keras model with a large dataset.

After some tinkering, I found another way to replicate the same error -- simply creating a constant tensor. For some reason, Tensorflow defaults to placing the tensor on GPU.

tf.constant(train_data)

Explicitly setting the device to CPU resolves this issue:

with tf.device('/CPU:0'):
    train_data = tf.constant(train_data)

When training a Keras model, all inputs must be CPU-placed Tensors to work:

with tf.device('/CPU:0'):
    train_data = tf.constant(train_data)

model.predict(train_data)
model.fit(train_data, train_data)

In my environment, the above snippet seems to train the model on GPU (as it should) without copying the full dataset to GPU. I would say that there is a need to investigate Keras codebase to make sure that device placements are correct.

[Update: Fixed an error in the last snippet as reported by @timsharpzim]

Sep 20 '22 15:09 hellodanylo

I have run into this bug as well while training a Keras model with a large dataset.

I'm also experiencing the same bug. For me, the way to workaround involved getting the indenting right vs hellodanylo's comment - only the tensor creation should be in the 'with cpu'

with tf.device('/CPU:0'): ____xdata = tf.convert_to_tensor(xdata) hist = model.fit(xdata, ......)

[edit - I can't get the code/indent to show correctly, so I used underscores to show it]

Sep 23 '22 04:09 timsharpzim

only the tensor creation should be in the 'with cpu'

@timsharpzim You are right, that's an error in my original snippet.

To summarize:

If you pass NumPy arrays to Model.fit/predict, it will create new Tensors placed on the GPU.
If you pass CPU-placed Tensors to Model.fit/predict, it will use them without copying to the GPU -- which is the desired behavior.

Sep 23 '22 04:09 hellodanylo