text-generation-webui
text-generation-webui copied to clipboard
Unable to load model on single GPU in a multi GPU machine
Describe the bug
I have a machine with 2 GPUs. I want to load the transformers model on only one of the GPUs. So I tried leaving the memory for the second GPU as 0, but then I get an error. I even tried using the gpu-memory
option in CMD_flags.txt
but that didn't work either,
Is there an existing issue for this?
- [X] I have searched the existing issues
Reproduction
- Run text-generation-webui on a multi GPU system
- Select a transformers model from the drop down (Eg: Llama-2-7b-chat-hf)
- Use the sliders to increase the memory used for only one of the GPUs (leave the other on 0)
- Click the Load button
Screenshot
Logs
Traceback (most recent call last):
File "C:\ProgramData\CommonFiles\text-generation-webui\modules\ui_model_menu.py", line 201, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "C:\ProgramData\CommonFiles\text-generation-webui\modules\models.py", line 79, in load_model
output = load_func_map[loader](model_name)
File "C:\ProgramData\CommonFiles\text-generation-webui\modules\models.py", line 210, in huggingface_loader
model = LoaderClass.from_pretrained(path_to_model, **params)
File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 3222, in from_pretrained
max_memory = get_balanced_memory(
File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 771, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 658, in get_max_memory
max_memory[key] = convert_file_size_to_int(max_memory[key])
File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 92, in convert_file_size_to_int
raise ValueError(err_msg)
ValueError: `size` 0MiB is not in a valid format. Use an integer for bytes, or a string with an unit (like '5.0GB').
System Info
OS: Windows Server 2022
GPUs: Nvidia H100 x 2
use CUDA_VISIBLE_DEVICES to hide a gpu.
the CUDA_VISIBLE_DEVICES trick allowed me to work around this, it seems like a very simple string formatting issue though
fwiw
export CUDA_VISIBLE_DEVICES=0
the bigger issue imo is that if you bypass this issue by setting 1mb for your second gpu (or any amount of mem, really) you get another error about not having enough memory
Having this same issue on a Windows 11 PC with an RTX 3060 (12GB) and a Quadro M6000 (24 GB). The software recognizes both GPUs but does not allow me to run the whole workload on the M6000. Setting the 3060 workload to zero generates the error PyroGenesis describes above, and setting it to 100MB while setting the M6000 to 24 GB results in an out-of-memory error loading the model, as SlimeQ noted. I have tried to manage this from the NVIDIA app but that also does not work.
@robsalk
- Find the GPU id of the M6000 (I used
torch.cuda.get_device_name(<id>)
to verify). - Open
cmd_windows.bat
. - Set the
CUDA_VISIBLE_DEVICES
environment variable to the GPU id you found. - Run
start_windows.bat
from the same command prompt.
If all works well, you should only see a single GPU in the model tab, which should be your M6000
CUDA_VISIBLE_DEVICES
does work for me, but I would've preferred a solution from within the webui (maybe using device_map
?). But if it's not feasible, I'll keep using the CUDA_VISIBLE_DEVICES
workaround.
Thank you! I will try that.
export CUDA_VISIBLE_DEVICES=0
did the trick for me
Same problem on Ubuntu 22.04. Hiding the second GPU works, but is there some acknowledgement that this is a bug that needs to be fixed? The GPU selector in Models seems irrelevant if you can't actually use it?
The correct values are sent to accelerate and it just does whatever it wants with them when building the device map.
Here is the solution: https://www.reddit.com/r/Oobabooga/comments/17vg2ot/multigpu_psa_how_to_disable_persistent_balanced/
Don't set the last slider to zero, set it to like 1 or 5 gigs, but if you have enough space on the first GPU it will load it all on that GPU and not use the other.
sounds like maybe we should pass device map "sequential"
I think so too, or have a option to send sequential or balanced.
I have another tip! If you are like me and want to load other models (which default load on gpu 0) you want to reverse the order the gpus are loaded up:
Go to line 663 in modeling.py found here: text-generation-webui-main\installer_files\env\Lib\site-packages\accelerate\utils
The line of code is in the get_max_memory function
change: gpu_devices.sort() to: gpu_devices.sort(reverse=True)
now your gpus will be loaded in reverse order if you do this and the first fix I posted. This way you can load reverse unbalanced and leave your gpu 0 for other models like tts, stt, and OCR.
Same problem here, I can't find modeling.py, I think files have been updated.
@JustDoIt65 I just installed the latest version of oobabooga and the two files I usually edit are still there.
modeling_utils.py can be found here: text-generation-webui-main\installer_files\env\Lib\site-packages\transformers Using the info at this link you can force the vram per gpu: https://www.reddit.com/r/Oobabooga/comments/17vg2ot/multigpu_psa_how_to_disable_persistent_balanced/
modeling.py found here: text-generation-webui-main\installer_files\env\Lib\site-packages\accelerate\utils change: gpu_devices.sort() to: gpu_devices.sort(reverse=True)
export CUDA_VISIBLE_DEVICES=0
did the trick for me
how to use it? on terminal?
@vivekratr
- Run
cmd_windows.bat
(or your OS specific executable) in the text-generation-webui folder. - Set the CUDA_VISIBLE_DEVICES like so:
- Windows:
set CUDA_VISIBLE_DEVICES=0
- Linux and MacOS:
export CUDA_VISIBLE_DEVICES=0
You can replace0
to use a different GPU
- Windows:
- Lastly, run
start_windows.bat
(or your OS specific executable) from the same shell window.
i have a theory
i've noticed at times that the gpu indices are reversed in ooba compared to what the system says, is it possible that we're allocating the memory to the wrong gpu?
I'm attempting to get a dual 4060 rig going and just getting flat out garbled output, seems there's something wrong with the configuration
I have submitted and tested a solution to properly parse 0MiB directly to accelerate. This way nothing needs to be fixed here. https://github.com/huggingface/accelerate/pull/2507
I forgot about this issue until I posted about it on reddit today. There is a faster workaround than having to find and edit the transformers files. A while back I dug into this issue when I was trying to optimize the memory balance for QLoRA training.
I first modified the Transformers modeling_utils.py
, as of my current install it is in text-generation-webui-main\installer_files\env\Lib\site-packages\transformers
In the modeling_utils.py
file, if you inject params['device_map'] = 'sequential'
into the code immediately after:
# change device_map into a map if we passed an int, a str or a torch.device
if isinstance(device_map, torch.device):
It forces sequential loading, allowing you to balance the model split as you see fit. Now, textgen-webui does pass a device map to transformers.
In text-generation-webui-main\modules
there is the file models.py
. As of my current install, I had either edited line 179, or put a new line there with the following: params['device_map'] = 'sequential'
Currently, my Transformers modeling_utils.py
is unedited, and I have the models.py
file edited. With this, Transformers will adhere to set GPU memory limits and fill what it sees as GPU0 up to said limit, then start loading GPU1 to its said limit or the remainder of the model. I just verified that this is working as of this post, I am able to adjust the model split with a stock transformers install.
Since it is working the same as it did when I first found this workaround, I'll post the memory allocation table I made up at the time:
7B No quant | GPU0/GPU1 max settings | Result |
---|---|---|
1000/20000 | 1069/13031 | |
3000/20000 | 3385/10703 | |
20000/5000 | 13760/360 | |
1/20000 | 834/13283 | |
70B load-in-4bit | 20000/20000 | 19514/17439 |
18000/21000 | 17814/19137 | |
17000/21000 | 16576/20383 |
As you can see, while it doesn't adhere perfectly to the max memory limits, it does allow adjusting the split relatively close to the limits. May not be exactly the stated issue, but I do believe they are related.
I have submitted and tested a solution to properly parse 0MiB directly to accelerate. This way nothing needs to be fixed here. huggingface/accelerate#2507
That works for me. Updated accelerate to 0.27.2 and finally 0MiB actually works to not use a specific GPU.
In my experience with text-gen-webui on Linux, CUDA_VISIBLE_DEVICES
gets ignored or overridden and all devices are visible regardless. It is the only one I have seen that behaves like that on any system tested.
Accelerate is 0.27.2 but I am still having this issue