OneTrainer icon indicating copy to clipboard operation
OneTrainer copied to clipboard

[Bug]: train device cuda:1

Open ToJl9TopTonop opened this issue 2 years ago • 4 comments

What happened?

image

train device cuda:1 does not work. train device cuda:0 works. train device cuda works.

Simple solution (crutch): In start-ui.bat add set CUDA_VISIBLE_DEVICES=1 image

What did you expect would happen?

there are video cards: RTX 4060 ti 16gb - cuda:0 or CUDA_VISIBLE_DEVICES=0 tesla p40 24gb - cuda:1 or CUDA_VISIBLE_DEVICES=1 <- I need to select this one tesla p4 8gb - cuda:2 or CUDA_VISIBLE_DEVICES=2

Relevant log output

Exception in thread Thread-1 (__training_thread_function):
Traceback (most recent call last):
  File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "C:\Users\IBers\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "D:\OneTrainer-master\modules\ui\TrainUI.py", line 475, in __training_thread_function
    ZLUDA.initialize_devices(self.train_config)
  File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 34, in initialize_devices
    if not is_zluda(config.train_device) and not is_zluda(config.temp_device):
  File "D:\OneTrainer-master\modules\zluda\ZLUDA.py", line 12, in is_zluda
    return torch.cuda.get_device_name(device).endswith("[ZLUDA]")
  File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 423, in get_device_name
    return get_device_properties(device).name
  File "D:\OneTrainer-master\venv\lib\site-packages\torch\cuda\__init__.py", line 456, in get_device_properties
    raise AssertionError("Invalid device id")
AssertionError: Invalid device id

Output of pip freeze

No response

ToJl9TopTonop avatar Apr 17 '24 12:04 ToJl9TopTonop

I don't have multiple GPUs to test this. But it seems a bit strange that cuda:0 works, and cuda:1 doesn't. The device name "cuda:1" is just passed to pytorch without any additional checks.

Nerogar avatar Apr 19 '24 15:04 Nerogar

curiously I am also finding this, when I did not previously.

jjohare avatar May 05 '24 19:05 jjohare

My understanding is that setting "CUDA_VISIBLE_DEVICES=1" makes CUDA only expose a single device to the application. You're telling the application to use the second device, but as far as it can tell there's only one (the one that CUDA is exposing to it). So I think you either want to use CUDA_VISIBLE_DEVICES or select a device in the application, not both.

(At least that was my experience, I ran into the same problem with the SD webui when trying to select a GPU)

noisefloordev avatar May 26 '24 05:05 noisefloordev

@ToJl9TopTonop Please confirm if this is still an issue for you and if not, what was your solution so I can document it in the wiki.

O-J1 avatar Jul 12 '24 01:07 O-J1

Given there has been no response I will be closing this. @jjohare please let me know if this still occurs after running update.bat/sh

O-J1 avatar Oct 13 '24 16:10 O-J1