tensorflow
tensorflow copied to clipboard
Allocating all memory, CUDA OOM
Issue type
Bug
Have you reproduced the bug with TensorFlow Nightly?
Yes
Source
source
TensorFlow version
1.13, 1.10
Custom code
Yes
OS platform and distribution
Linux, Ubuntu 20
Mobile device
No response
Python version
3.8
Bazel version
No response
GCC/compiler version
No response
CUDA/cuDNN version
11.7
GPU model and memory
V100, 34GB
Current behavior?
TF tries to allocate ALL memory despite not calling any functions that should put any data on the GPU.
Standalone code to reproduce the issue
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
from keras import backend as K
tf.config.experimental_run_functions_eagerly(not True)
message = "No GPU found. To actually train on CPU remove this assert."
assert tf.config.experimental.list_physical_devices("GPU"), message
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
print("Found GPUs: ", gpus)
# Restrict TensorFlow to only use the first GPU
try:
# Currently, memory growth needs to be the same across GPUs
for gpu in gpus:
print("Setting memory growth for ", gpu)
tf.config.experimental.set_memory_growth(gpu, True)
print("Setting visible devices to ", gpus[0])
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
except RuntimeError as e:
# Visible devices must be set before GPUs have been initialized
print(e)
Relevant log output
WARNING:tensorflow:From mvmwm/_tf_error_test.py:6: experimental_run_functions_eagerly (from tensorflow.python.eager.polymorphic_function.quarantine) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.
Found GPUs: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Setting memory growth for PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Setting visible devices to PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
2024-03-22 14:06:37.455972: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34087305216
Aborted (core dumped)
@dennisushi I wasn't able to replicate the issue on colab using TF v2.15, please find the gist here. Kindly use TF latest Version as TF v1.x is no longer actively supported. Thank you!
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.