text-generation-webui icon indicating copy to clipboard operation
text-generation-webui copied to clipboard

Unable to load model on single GPU in a multi GPU machine

Open PyroGenesis opened this issue 1 year ago • 22 comments

Describe the bug

I have a machine with 2 GPUs. I want to load the transformers model on only one of the GPUs. So I tried leaving the memory for the second GPU as 0, but then I get an error. I even tried using the gpu-memory option in CMD_flags.txt but that didn't work either,

Is there an existing issue for this?

  • [X] I have searched the existing issues

Reproduction

  1. Run text-generation-webui on a multi GPU system
  2. Select a transformers model from the drop down (Eg: Llama-2-7b-chat-hf)
  3. Use the sliders to increase the memory used for only one of the GPUs (leave the other on 0)
  4. Click the Load button

Screenshot

image

Logs

Traceback (most recent call last):
  File "C:\ProgramData\CommonFiles\text-generation-webui\modules\ui_model_menu.py", line 201, in load_model_wrapper
    shared.model, shared.tokenizer = load_model(shared.model_name, loader)
  File "C:\ProgramData\CommonFiles\text-generation-webui\modules\models.py", line 79, in load_model
    output = load_func_map[loader](model_name)
  File "C:\ProgramData\CommonFiles\text-generation-webui\modules\models.py", line 210, in huggingface_loader
    model = LoaderClass.from_pretrained(path_to_model, **params)
  File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\transformers\models\auto\auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\transformers\modeling_utils.py", line 3222, in from_pretrained
    max_memory = get_balanced_memory(
  File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 771, in get_balanced_memory
    max_memory = get_max_memory(max_memory)
  File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 658, in get_max_memory
    max_memory[key] = convert_file_size_to_int(max_memory[key])
  File "C:\ProgramData\CommonFiles\text-generation-webui\installer_files\env\lib\site-packages\accelerate\utils\modeling.py", line 92, in convert_file_size_to_int
    raise ValueError(err_msg)
ValueError: `size` 0MiB is not in a valid format. Use an integer for bytes, or a string with an unit (like '5.0GB').

System Info

OS: Windows Server 2022
GPUs: Nvidia H100 x 2

PyroGenesis avatar Oct 05 '23 23:10 PyroGenesis

use CUDA_VISIBLE_DEVICES to hide a gpu.

Ph0rk0z avatar Oct 07 '23 20:10 Ph0rk0z

the CUDA_VISIBLE_DEVICES trick allowed me to work around this, it seems like a very simple string formatting issue though

fwiw export CUDA_VISIBLE_DEVICES=0

the bigger issue imo is that if you bypass this issue by setting 1mb for your second gpu (or any amount of mem, really) you get another error about not having enough memory

SlimeQ avatar Oct 08 '23 02:10 SlimeQ

Having this same issue on a Windows 11 PC with an RTX 3060 (12GB) and a Quadro M6000 (24 GB). The software recognizes both GPUs but does not allow me to run the whole workload on the M6000. Setting the 3060 workload to zero generates the error PyroGenesis describes above, and setting it to 100MB while setting the M6000 to 24 GB results in an out-of-memory error loading the model, as SlimeQ noted. I have tried to manage this from the NVIDIA app but that also does not work.

robsalk avatar Oct 09 '23 02:10 robsalk

@robsalk

  1. Find the GPU id of the M6000 (I used torch.cuda.get_device_name(<id>) to verify).
  2. Open cmd_windows.bat.
  3. Set the CUDA_VISIBLE_DEVICES environment variable to the GPU id you found.
  4. Run start_windows.bat from the same command prompt.

If all works well, you should only see a single GPU in the model tab, which should be your M6000

PyroGenesis avatar Oct 10 '23 00:10 PyroGenesis

CUDA_VISIBLE_DEVICES does work for me, but I would've preferred a solution from within the webui (maybe using device_map?). But if it's not feasible, I'll keep using the CUDA_VISIBLE_DEVICES workaround.

PyroGenesis avatar Oct 10 '23 00:10 PyroGenesis

Thank you! I will try that.

robsalk avatar Oct 10 '23 03:10 robsalk

export CUDA_VISIBLE_DEVICES=0 did the trick for me

egeres avatar Oct 18 '23 13:10 egeres

Same problem on Ubuntu 22.04. Hiding the second GPU works, but is there some acknowledgement that this is a bug that needs to be fixed? The GPU selector in Models seems irrelevant if you can't actually use it?

Bluejay47 avatar Oct 28 '23 13:10 Bluejay47

The correct values are sent to accelerate and it just does whatever it wants with them when building the device map.

Ph0rk0z avatar Oct 28 '23 17:10 Ph0rk0z

Here is the solution: https://www.reddit.com/r/Oobabooga/comments/17vg2ot/multigpu_psa_how_to_disable_persistent_balanced/

Don't set the last slider to zero, set it to like 1 or 5 gigs, but if you have enough space on the first GPU it will load it all on that GPU and not use the other.

RandomInternetPreson avatar Nov 14 '23 23:11 RandomInternetPreson

sounds like maybe we should pass device map "sequential"

Ph0rk0z avatar Nov 15 '23 14:11 Ph0rk0z

I think so too, or have a option to send sequential or balanced.

RandomInternetPreson avatar Nov 15 '23 14:11 RandomInternetPreson

I have another tip! If you are like me and want to load other models (which default load on gpu 0) you want to reverse the order the gpus are loaded up:

Go to line 663 in modeling.py found here: text-generation-webui-main\installer_files\env\Lib\site-packages\accelerate\utils

The line of code is in the get_max_memory function

change: gpu_devices.sort() to: gpu_devices.sort(reverse=True)

now your gpus will be loaded in reverse order if you do this and the first fix I posted. This way you can load reverse unbalanced and leave your gpu 0 for other models like tts, stt, and OCR.

RandomInternetPreson avatar Nov 23 '23 20:11 RandomInternetPreson

Same problem here, I can't find modeling.py, I think files have been updated.

JustDoIt65 avatar Dec 10 '23 11:12 JustDoIt65

@JustDoIt65 I just installed the latest version of oobabooga and the two files I usually edit are still there.

modeling_utils.py can be found here: text-generation-webui-main\installer_files\env\Lib\site-packages\transformers Using the info at this link you can force the vram per gpu: https://www.reddit.com/r/Oobabooga/comments/17vg2ot/multigpu_psa_how_to_disable_persistent_balanced/

modeling.py found here: text-generation-webui-main\installer_files\env\Lib\site-packages\accelerate\utils change: gpu_devices.sort() to: gpu_devices.sort(reverse=True)

RandomInternetPreson avatar Dec 10 '23 19:12 RandomInternetPreson

export CUDA_VISIBLE_DEVICES=0 did the trick for me

how to use it? on terminal?

vivekratr avatar Jan 08 '24 16:01 vivekratr

@vivekratr

  1. Run cmd_windows.bat (or your OS specific executable) in the text-generation-webui folder.
  2. Set the CUDA_VISIBLE_DEVICES like so:
    • Windows: set CUDA_VISIBLE_DEVICES=0
    • Linux and MacOS: export CUDA_VISIBLE_DEVICES=0
      You can replace 0 to use a different GPU
  3. Lastly, run start_windows.bat (or your OS specific executable) from the same shell window.

PyroGenesis avatar Jan 11 '24 20:01 PyroGenesis

i have a theory

i've noticed at times that the gpu indices are reversed in ooba compared to what the system says, is it possible that we're allocating the memory to the wrong gpu?

I'm attempting to get a dual 4060 rig going and just getting flat out garbled output, seems there's something wrong with the configuration

SlimeQ avatar Jan 13 '24 22:01 SlimeQ

I have submitted and tested a solution to properly parse 0MiB directly to accelerate. This way nothing needs to be fixed here. https://github.com/huggingface/accelerate/pull/2507

StoyanStAtanasov avatar Feb 29 '24 02:02 StoyanStAtanasov

I forgot about this issue until I posted about it on reddit today. There is a faster workaround than having to find and edit the transformers files. A while back I dug into this issue when I was trying to optimize the memory balance for QLoRA training.

I first modified the Transformers modeling_utils.py, as of my current install it is in text-generation-webui-main\installer_files\env\Lib\site-packages\transformers

In the modeling_utils.py file, if you inject params['device_map'] = 'sequential' into the code immediately after:

# change device_map into a map if we passed an int, a str or a torch.device
        if isinstance(device_map, torch.device):

It forces sequential loading, allowing you to balance the model split as you see fit. Now, textgen-webui does pass a device map to transformers.

In text-generation-webui-main\modules there is the file models.py. As of my current install, I had either edited line 179, or put a new line there with the following: params['device_map'] = 'sequential'

Currently, my Transformers modeling_utils.py is unedited, and I have the models.py file edited. With this, Transformers will adhere to set GPU memory limits and fill what it sees as GPU0 up to said limit, then start loading GPU1 to its said limit or the remainder of the model. I just verified that this is working as of this post, I am able to adjust the model split with a stock transformers install.

Since it is working the same as it did when I first found this workaround, I'll post the memory allocation table I made up at the time:

7B No quant GPU0/GPU1 max settings Result
1000/20000 1069/13031
3000/20000 3385/10703
20000/5000 13760/360
1/20000 834/13283
70B load-in-4bit 20000/20000 19514/17439
18000/21000 17814/19137
17000/21000 16576/20383

As you can see, while it doesn't adhere perfectly to the max memory limits, it does allow adjusting the split relatively close to the limits. May not be exactly the stated issue, but I do believe they are related.

Dalhimar avatar Mar 09 '24 18:03 Dalhimar

I have submitted and tested a solution to properly parse 0MiB directly to accelerate. This way nothing needs to be fixed here. huggingface/accelerate#2507

That works for me. Updated accelerate to 0.27.2 and finally 0MiB actually works to not use a specific GPU.

In my experience with text-gen-webui on Linux, CUDA_VISIBLE_DEVICES gets ignored or overridden and all devices are visible regardless. It is the only one I have seen that behaves like that on any system tested.

Kadah avatar Apr 18 '24 04:04 Kadah

Accelerate is 0.27.2 but I am still having this issue

Urammar avatar May 04 '24 17:05 Urammar