text-generation-webui
text-generation-webui copied to clipboard
Using 2xGPU but most inference load is lopsided to 1 GPU
Trying to get llama to write a story but no matter what params I set, the gpu usage is very lopsided with 1 gpu doing like 80% of the work always the other sitting almost idle.
#2x 3090 on 13900k
python server.py --auto-devices --gpu-memory 16 16 --model llama-30b --load-in-8bit
Is this natural to the way pytorch/transformers are doing or there is something wrong with my setup?
The result is horrible output speed as everything is bottle necked to 1 GPU even though model is spread to 2x GPU.
Thanks
https://github.com/oobabooga/text-generation-webui/issues/147
I would try increasing to --gpu-memory 23 23
or 22 22
.
These numbers are used to construct the max_memory
parameter in AutoModelForCausalLM.from_pretrained
. For your exact command-line flags, this is the command that ends up being used:
AutoModelForCausalLM.from_pretrained(
Path('models/llama-30b'),
low_cpu_mem_usage=True,
load_in_8bit=True,
max_memory={0: '16GiB', 1: '16GiB', 'cpu': '99GiB'},
device_map='auto'
)
I would try asking on accelerate why the second GPU memory usage is low, since device_map='auto'
is handled by their library.
@Ph0rk0z I think so too. the HF port of llama is almost 2x as slow vs FB using torch.distributed.run. With FB repo, I can generate tokens almost 2x as fast as HF version on 2x3090 due to more evenly distributed GPU task load (not memory load).
@Ph0rk0z I never dealt with multi gpu setup before until llama so if what you say is true, Hf doesn't scale at all. Hf does a great job load balancing vram but if thats all it does, then it scales negatively for speeding up token gen for gpu > 1. It would perfectly explain my current issue.
So using the HF version is a no go when it comes to multigpu? That's so weird. Any way to get the original models from the leak working directly instead of being converted to transfomers? Seems that is what makes them go slower right?
It is possible to use those original .pth files by reverting to commit bd8aac8fa43daa7bd0e2d3d2e446a403a447c744
.
I'm not sure if it is worth it because only the top_p
and temperature
parameters were available. I find that without top_k
and repetition_penalty
, the results are drastically worse.
I also had a problem with using 2xGPU when testing 13B 16bit llama. I have a 3090 (24GB) and 3060 (12GB). Unfortunately when using the two together for whatever reason the VRAM usage for the 3090 caps out at 12GB, no matter the --gpu-memory
settings I use. To explore the issue I set
max_memory
in models.py
to {0: '21GiB', 1: '10GiB, 'cpu': '0GiB'}
but get an error because device_map='auto'
refuses to place any layers beyond the 12GB. Finally just out of curiosity I found that you can actually fully load the model in vram if you replace params.append("device_map='auto'")
in models.py
with a suitable mapping, I went with:
params.append('''device_map={"lm_head": 0, "model.decoder.embed_tokens": 0, "model.decoder.layers.0": 0, "model.decoder.layers.1": 0, "model.decoder.layers.10": 0, "model.decoder.layers.11": 0, "model.decoder.layers.12": 0, "model.decoder.layers.13": 0, "model.decoder.layers.14": 0, "model.decoder.layers.15": 0, "model.decoder.layers.16": 0, "model.decoder.layers.17": 0, "model.decoder.layers.18": 0, "model.decoder.layers.19": 0, "model.decoder.layers.2": 0, "model.decoder.layers.20": 0, "model.decoder.layers.21": 0, "model.decoder.layers.22": 0, "model.decoder.layers.23": 0, "model.decoder.layers.24": 0, "model.decoder.layers.25": 1, "model.decoder.layers.26": 1, "model.decoder.layers.27": 1, "model.decoder.layers.28": 1, "model.decoder.layers.29": 1, "model.decoder.layers.3": 0, "model.decoder.layers.30": 1, "model.decoder.layers.31": 1, "model.decoder.layers.32": 1, "model.decoder.layers.33": 1, "model.decoder.layers.34.attention_norm": 0, "model.decoder.layers.34.feed_forward": 0, "model.decoder.layers.34.ffn_norm": 0, "model.decoder.layers.34.self_attn": 0, "model.decoder.layers.35": 1, "model.decoder.layers.36": 1, "model.decoder.layers.37": 1, "model.decoder.layers.38": 1, "model.decoder.layers.39": 1, "model.decoder.layers.4": 0, "model.decoder.layers.5": 0, "model.decoder.layers.6": 0, "model.decoder.layers.7": 0, "model.decoder.layers.8": 0, "model.decoder.layers.9": 0, "model.decoder.norm": 0}''')
The model loads fine and correctly uses 20GB of the 3090 but when I actually try and inference I get another error RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
I'll probably investigate this further and see If I can figure out why device_map='auto'
isn't working correctly but I thought I'd check in and see if maybe I'm overlooking an easy solution.
@musicurgy did you figure out how to fix that issue? or did you use another framework? im on same hardware.
@musicurgy did you figure out how to fix that issue? or did you use another framework? im on same hardware.
I'm currently using the 4bit llama model which has full support for multi-gpu with different VRAM amounts, of course 4bit 33B llama already fits on one 3090 alone so Its mostly a moot point anyway. That said If you really want to run 16bit with multiple unbalanced GPUs KoboldAI fully supports it which is partially why I just gave up trying to debug it. If anyone actually wants to fix this I suspect the issues is almost identical to #219
I can confirm it. I use 2 different GPUs and so far i could not get it to load it into the second GPU. Even "--gpu-memory 0 8" won't change a thing. The layers always fill up GPU 0 instead of using the allocated memory of the second GPU.
I only have one GPU, so multi-GPU is poorly tested in the web UI. If someone can find a fix, please submit it as a PR.
So, I may be way out in left field, but my goal is to have full context llama/alpaca 30B, and I have a 3090 +2080 in the same rig for figuring out how to properly split. I have found that its really funky to get strait from the command line to the desired split, and its best to load with --auto-devices, and a small model from menu select, or directly, hop into the UI, set the scales for the GPU split you want, unload model, and then load the actual model you want, and you can get the split you want. However, because I have a 2080, I am running into likely https://github.com/pytorch/pytorch/issues/31285 where the differing GPU architecture versions means I need to compile some things from source, though I only touched the surface in my investigation. My issues aside, I am not sure how specifically you can ensure where the context growth will end up, device 0 or 1, but I am hoping to find a way to ensure the context related growth happens on the 24G 3090.
Hi there!
@Qubitium try use this setting: --load-in-8bit
I have the same bug but this parameter was fix it.
This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.