GPU memory usage depends on data size
I am cross-posting this bug from Tensorflow because I suspect it is due to Keras.
It seems that model.fit and model.predict, when given numpy arrays, try to load the whole dataset in the GPU, instead of just sending minibatches. The bug appears on versions 2.6-2.10, but not on 2.5.
System information.
- OS Platform and Distribution: Fedora 36
- TensorFlow installed from (source or binary): binary wheel
- TensorFlow version (use command below): 2.6-2.10.
- Python version: 3.6
- GPU model and memory: RTX 2070, 8 GiB.
- Exact command to reproduce:
main(big=False)works, butmain(big=True)doesn't.
Standalone code to reproduce the issue.
The following code trains a simple model on either 15 MiB of data, or 15 GiB. The computer has 32 GiB, so it can hold it, but it quickly fills the GPU.
import numpy as np
from keras import layers, models
def get_model(n_inputs: int) -> models.Model:
inp = layers.Input(shape=(n_inputs,))
out = layers.Dense(n_inputs, activation='linear')(inp)
m = models.Model(inputs=inp, outputs=out)
m.compile(loss='mse', optimizer='adam')
m.summary()
return m
def main(big: bool):
model = get_model(4096)
N = 1_000_000 if big else 1_000
train_data = np.zeros((N, 4096), dtype=np.float32)
print(f'Evaluating on {train_data.shape[0]} data points and {train_data.nbytes / 1024**2} MiB.')
model.predict(train_data, batch_size=16, verbose=1)
# Also:
model.fit(train_data, train_data, epochs=3, verbose=1, batch_size=16, max_queue_size=1)
Using tf.data doesn't solve the problem:
def wrap_data(data: np.ndarray) -> tf.data.Dataset:
dataset = tf.data.Dataset.from_tensor_slices(data)
shuffled = dataset.shuffle(buffer_size=5, reshuffle_each_iteration=True)
batched = shuffled.batch(16, num_parallel_calls=tf.data.AUTOTUNE, deterministic=False)
autoencoder = batched.map(lambda x: (x, x)).prefetch(5)
return autoencoder
train_data_iterator = wrap_data(train_data)
model.fit(train_data_iterator, epochs=3, verbose=1, max_queue_size=1)
Using cuda_malloc_async doesn't fix it.
Expected behaviour
I'd expect only the current minibatch to be transferred to the GPU, and not needing to hold the full dataset.
Source code / logs. Here is the error:
-09-06 10:52:22.131172: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 15.26GiB (rounded to 16384000000)requested by op _EagerConst
** Note ** I was asked to reproduce it in a Colab instance. At least the ones I have access to have less RAM than GPU vRAM, so this problem cannot be reproduced there.
I see both have been assigned to sushreebarsa. Sorry if it is too noisy!
I have run into this bug as well while training a Keras model with a large dataset.
After some tinkering, I found another way to replicate the same error -- simply creating a constant tensor. For some reason, Tensorflow defaults to placing the tensor on GPU.
tf.constant(train_data)
Explicitly setting the device to CPU resolves this issue:
with tf.device('/CPU:0'):
train_data = tf.constant(train_data)
When training a Keras model, all inputs must be CPU-placed Tensors to work:
with tf.device('/CPU:0'):
train_data = tf.constant(train_data)
model.predict(train_data)
model.fit(train_data, train_data)
In my environment, the above snippet seems to train the model on GPU (as it should) without copying the full dataset to GPU. I would say that there is a need to investigate Keras codebase to make sure that device placements are correct.
[Update: Fixed an error in the last snippet as reported by @timsharpzim]
I have run into this bug as well while training a Keras model with a large dataset.
I'm also experiencing the same bug. For me, the way to workaround involved getting the indenting right vs hellodanylo's comment - only the tensor creation should be in the 'with cpu'
with tf.device('/CPU:0'): ____xdata = tf.convert_to_tensor(xdata) hist = model.fit(xdata, ......)
[edit - I can't get the code/indent to show correctly, so I used underscores to show it]
only the tensor creation should be in the 'with cpu'
@timsharpzim You are right, that's an error in my original snippet.
To summarize:
- If you pass NumPy arrays to
Model.fit/predict, it will create new Tensors placed on the GPU. - If you pass CPU-placed Tensors to
Model.fit/predict, it will use them without copying to the GPU -- which is the desired behavior.