tensorflow icon indicating copy to clipboard operation
tensorflow copied to clipboard

Allocating all memory, CUDA OOM

Open dennisushi opened this issue 1 year ago • 2 comments

Issue type

Bug

Have you reproduced the bug with TensorFlow Nightly?

Yes

Source

source

TensorFlow version

1.13, 1.10

Custom code

Yes

OS platform and distribution

Linux, Ubuntu 20

Mobile device

No response

Python version

3.8

Bazel version

No response

GCC/compiler version

No response

CUDA/cuDNN version

11.7

GPU model and memory

V100, 34GB

Current behavior?

TF tries to allocate ALL memory despite not calling any functions that should put any data on the GPU.

Standalone code to reproduce the issue

import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
import tensorflow as tf
from keras import backend as K

tf.config.experimental_run_functions_eagerly(not True)
message = "No GPU found. To actually train on CPU remove this assert."
assert tf.config.experimental.list_physical_devices("GPU"), message

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    print("Found GPUs: ", gpus)
    # Restrict TensorFlow to only use the first GPU
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            print("Setting memory growth for ", gpu)
            tf.config.experimental.set_memory_growth(gpu, True)
        print("Setting visible devices to ", gpus[0])
        tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
    except RuntimeError as e:
        # Visible devices must be set before GPUs have been initialized
        print(e)

Relevant log output

WARNING:tensorflow:From mvmwm/_tf_error_test.py:6: experimental_run_functions_eagerly (from tensorflow.python.eager.polymorphic_function.quarantine) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.config.run_functions_eagerly` instead of the experimental version.
Found GPUs:  [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Setting memory growth for  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
Setting visible devices to  PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')
2024-03-22 14:06:37.455972: F tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling error INTERNAL: failed initializing StreamExecutor for CUDA device ordinal 0: INTERNAL: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 34087305216
Aborted (core dumped)

dennisushi avatar Mar 22 '24 14:03 dennisushi

@dennisushi I wasn't able to replicate the issue on colab using TF v2.15, please find the gist here. Kindly use TF latest Version as TF v1.x is no longer actively supported. Thank you!

sushreebarsa avatar Mar 27 '24 06:03 sushreebarsa

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions[bot] avatar Apr 04 '24 01:04 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

github-actions[bot] avatar Apr 11 '24 01:04 github-actions[bot]

Are you satisfied with the resolution of your issue? Yes No

google-ml-butler[bot] avatar Apr 11 '24 01:04 google-ml-butler[bot]