tensorflow Allocating all memory, CUDA OOM

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

1.13, 1.10

Custom code

Yes

OS platform and distribution

Linux, Ubuntu 20

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.7

GPU model and memory

V100, 34GB

Current behavior?

TF tries to allocate ALL memory despite not calling any functions that should put any data on the GPU.

Standalone code to reproduce the issue

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
from keras import backend as K

tf.config.experimental_run_functions_eagerly(not True)
message = "No GPU found. To actually train on CPU remove this assert."
assert tf.config.experimental.list_physical_devices("GPU"), message

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("Found GPUs: ", gpus)
    # Restrict TensorFlow to only use the first GPU
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            print("Setting memory growth for ", gpu)
            tf.config.experimental.set_memory_growth(gpu, True)
        print("Setting visible devices to ", gpus[0])
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)

Relevant log output

WARNING:tensorflow:From mvmwm/_tf_error_test.py:6: experimental_run_functions_eagerly (from tensorflow.python.eager.polymorphic_function.quarantine) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.
Found GPUs:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Setting memory growth for  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Setting visible devices to  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
2024-03-22 14:06:37.455972: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34087305216
Aborted (core dumped)

Mar 22 '24 14:03 dennisushi

@dennisushi I wasn't able to replicate the issue on colab using TF v2.15, please find the gist here. Kindly use TF latest Version as TF v1.x is no longer actively supported. Thank you!

Mar 27 '24 06:03 sushreebarsa

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

Apr 04 '24 01:04 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

Apr 11 '24 01:04 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

Apr 11 '24 01:04 google-ml-butler[bot]

tensorflow tensorflow copied to clipboard

Allocating all memory, CUDA OOM

Issue type

Have you reproduced the bug with TensorFlow Nightly?

Source

TensorFlow version

Custom code

OS platform and distribution

Mobile device

Python version

Bazel version

GCC/compiler version

CUDA/cuDNN version

GPU model and memory

Current behavior?

Standalone code to reproduce the issue

Relevant log output

tensorflow
tensorflow copied to clipboard