[Bug]: train device cuda:1
What happened?
train device cuda:1 does not work. train device cuda:0 works. train device cuda works.
Simple solution (crutch):
In start-ui.bat add set CUDA_VISIBLE_DEVICES=1
What did you expect would happen?
there are video cards: RTX 4060 ti 16gb - cuda:0 or CUDA_VISIBLE_DEVICES=0 tesla p40 24gb - cuda:1 or CUDA_VISIBLE_DEVICES=1 <- I need to select this one tesla p4 8gb - cuda:2 or CUDA_VISIBLE_DEVICES=2
Relevant log output
Exception in thread Thread-1 (__training_thread_function):
Traceback (most recent call last):
File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
self.run()
File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "D:\OneTrainer-master\modules\ui\TrainUI.py", line 475, in __training_thread_function
ZLUDA.initialize_devices(self.train_config)
File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 34, in initialize_devices
if not is_zluda(config.train_device) and not is_zluda(config.temp_device):
File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 12, in is_zluda
return torch.cuda.get_device_name(device).endswith("[ZLUDA]")
File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 423, in get_device_name
return get_device_properties(device).name
File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 456, in get_device_properties
raise AssertionError("Invalid device id")
AssertionError: Invalid device id
Output of pip freeze
No response
I don't have multiple GPUs to test this. But it seems a bit strange that cuda:0 works, and cuda:1 doesn't. The device name "cuda:1" is just passed to pytorch without any additional checks.
curiously I am also finding this, when I did not previously.
My understanding is that setting "CUDA_VISIBLE_DEVICES=1" makes CUDA only expose a single device to the application. You're telling the application to use the second device, but as far as it can tell there's only one (the one that CUDA is exposing to it). So I think you either want to use CUDA_VISIBLE_DEVICES or select a device in the application, not both.
(At least that was my experience, I ran into the same problem with the SD webui when trying to select a GPU)
@ToJl9TopTonop Please confirm if this is still an issue for you and if not, what was your solution so I can document it in the wiki.
Given there has been no response I will be closing this. @jjohare please let me know if this still occurs after running update.bat/sh