device_lib.list_local_devices() doesn't return in the CUDA build up to 2080
Any batch script hangs, I traced it and it freezes in tensorflow when it calls device_lib.list_local_devices() In: C:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti_internal\DeepFaceLab\core\leras\device.py GPU: Geforce 750 Ti Win 10
import tensorflow as tf
from tensorflow.python.client import device_lib
print(f"list_local_devices()={device_lib.list_local_devices()}")
I tried several things: if there was an incompatibility with the installed newer CUDA, but it shouldn't as the build has its own directory and it's an old tensorflow 1.13. The paths are set by setenv.bat,, but in addition I added them in the system's Path, also I tried with copying the .dll files both in the .bat folder and in the main.py.
I've been using the DirectX12 version as an alternative. The GPU is 750 Ti and initially I thought that it was just too old, but I just discovered it's supposed to work as it supports newer CUDA versions. Also there's not an error message, but the call to "list_local_devices" doesn't return.
If I run setenv.bat and then I call the build's python, then import the tf. and call list_local_devices interactively, the function recognizes the GPU and prints a correct output, but then the CLI session hangs. The system has also an integrated Intel GPU HD530.
I understand that this seems to be a tensorflow or drivers' issue, but does anyone have solved it? Thanks.
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8>python
Python 3.6.8 (tags/v3.6.8:3c6b436a57, Dec 24 2018, 00:16:47) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
c:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\python-3.6.8\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
>>>
>>> from tensorflow.python.client import device_lib
>>> print(f"list_local_devices()={device_lib.list_local_devices()}")
2022-05-09 22:11:18.429936: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
2022-05-09 22:11:18.551876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 750 Ti major: 5 minor: 0 memoryClockRate(GHz): 1.0845
pciBusID: 0000:01:00.0
totalMemory: 2.00GiB freeMemory: 194.50MiB
2022-05-09 22:11:18.552651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
@staticmethod
def _get_tf_devices_proc(q : multiprocessing.Queue):
print("_get_tf_devices_proc")
print(sys.platform[0:3])
if sys.platform[0:3] == 'win':
compute_cache_path = Path(os.environ['APPDATA']) / 'NVIDIA' / ('ComputeCache_ALL')
os.environ['CUDA_CACHE_PATH'] = str(compute_cache_path)
print("CUDA_CACHE_PATH={os.environ['CUDA_CACHE_PATH']}")
if not compute_cache_path.exists():
io.log_info("Caching GPU kernels...")
compute_cache_path.mkdir(parents=True, exist_ok=True)
import tensorflow
tf_version = tensorflow.version.VERSION
print(f"tf_version={tf_version}")
#if tf_version is None:
# tf_version = tensorflow.version.GIT_VERSION
if tf_version[0] == 'v':
tf_version = tf_version[1:]
if tf_version[0] == '2':
tf = tensorflow.compat.v1
else:
tf = tensorflow
import logging
# Disable tensorflow warnings
tf_logger = logging.getLogger('tensorflow')
tf_logger.setLevel(logging.ERROR)
from tensorflow.python.client import device_lib
print("AFTER: from tensorflow.python.client import device_lib")
devices = []
print(f"list_local_devices()={device_lib.list_local_devices()}") ### HANGS HERE ###
physical_devices = device_lib.list_local_devices()
physical_devices_f = {}
print("BEFORE: for dev in physical_devices:")
```
I found a solution: I just manually set the paprameters of the device and skipped the check... :D
C:\DFL\DeepFaceLab_NVIDIA_up_to_RTX2080Ti\_internal\DeepFaceLab\core\leras\device.py
...
skip_physical_devices = True
...
@staticmethod
def _get_tf_devices_proc(q : multiprocessing.Queue):
#do not call: device_lib.list_local_devices()
devices = []
physical_devices_f = {}
...
if not skip_physical_devices:
print(f"list_local_devices()={device_lib.list_local_devices()}")
max_memory = 1556925644 # 1.45 GB
physical_devices_f = {}
physical_devices_f[0] = ('GPU', '750 Ti', 1556925644)
print(physical_devices_f)
q.put(physical_devices_f)
time.sleep(0.1)
if not skip_physical_devices:
physical_devices = device_lib.list_local_devices()
physical_devices_f = {}
...
There was another apparent issue, initially it run through the model initialization, but then returned OOM where it wasn't supposed to do given a very small model, HWMonitor stayed at GPU memory 97% and I was unable to release it, I think it is that TF clearing-the-memory issue: https://github.com/tensorflow/tensorflow/issues/36465?msclkid=0b0cb26dd03c11ec9d212ba18e7d751d
The solution with numba didn't solve it.
from numba import cuda
cuda.select_device(0)
cuda.close()
However when running the DirectX12 DFL with that frozen 97% indication, it successfully run a big model which needs fills the memory, so possibly the indication of HWMonitor was not correct.
In my settings, the speed up of CUDA vs DX12 seemed about 33%. For: Res: 96; AE-E-D-M: 128-64-64-16, saved model size: 467 MB.
The model fits in GPU and saturates the GPU load. ([y] Place models and optimizer on GPU ( y/n ?:help ) : y [y] Use AdaBelief optimizer? ( y/n ?:help ) : Res: 96; AE-E-D-M: 128-64-64-16, saved model size: 467 MB,
Indeed, that 1.45 GB limit out of 2 GB was another issue and I know it's for any GPU, a friend with 1070 Ti reports 6.63 GB out of 8 GB - Windows not letting a program to use all, or at least more, of the memory.
One solution I tried was to make/to "trick" Windows to use the integrated GPU for the GUI, by waking it up or restarting with one monitor connected to the integrated GPU output.
Then the dedicated GPU's memory usage stays at 0% if not used, however after it is fully loaded with a big model it just tops at a lower %, instead of close to 100%.
I will try to set that allocation size higher and see will it fit a bigger model or crash, but a bit later when I am comfortable with that possibility.
Setting the CUDA memory higher/up to almost the maximum (1.98 GB) didn't crash the system. It returns just OOM errors eventually. However I didn't manage to fit a bigger batch size, that amount maybe is just for "info" purposes and Windows reserves whatever it wants anyway.
Initially there was unknown memory issue with running big models, one which I run with a batch size 6 on DirectX12 managed to fit only a batch=4 in CUDA. I assumed either CUDA was taking more memory than the DX version or it wasn't cleared properly. That was resolved after performing the procedure which I mentioned in the previous message:
- PC-->Sleep
- Connect the monitor to the integrated GPU output
- Resume
- GPU memory usage = 0%
Then the batch size 6 was encompassed, but unfortunately the GPU couldn't fit 7, even with 1.98 GB set manually, while batch 6 fits within the default returned size: 1.45 GB. So it seems that the returned number is only a "suggestion" and it couldn't be overriden so simply.
Did you ever find the answer? If so, would you mind sharing it and closing this issue?
No, I used the work around. The reserved part of the memory is Windows' business I guess and I haven't worked with DFL on Linux, somebody who did may say what's the maximum allocable GPU RAM.