text-generation-webui
text-generation-webui copied to clipboard
Cannot load Pyg-6B with 8GB VRAM with deepspeed on WSL2
I've got WSL2 Ubuntu running on Windows 11 configured to use 28 GB of RAM: Tried both unsharded and sharded to 1GB chuncks model.
free -h --giga
total used free shared buff/cache available
Mem: 28G 108M 27G 0.0K 745M 27G
Swap: 7.2G 0B 7.2G
When i try to load pyg-6b model with:
deepspeed --num_gpus=1 server.py --deepspeed --cai-chat
I get:
[2023-02-28 19:49:23,376] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-02-28 19:49:23,448] [INFO] [runner.py:548:main] cmd = /root/miniconda3/envs/textgen/bin/python -u -m deepspeed.launcher.launch --world_info= --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None server.py --deepspeed --cai-chat [2023-02-28 19:49:25,028] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0]} [2023-02-28 19:49:25,028] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-02-28 19:49:25,029] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-02-28 19:49:25,029] [INFO] [launch.py:162:main] dist_world_size=1 [2023-02-28 19:49:25,029] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-02-28 19:49:29,383] [INFO] [comm.py:657:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Warning: chat mode currently becomes somewhat slower with text streaming on. Consider starting the web UI with the --no-stream option.
Loading the extension "gallery"... Ok. The following models are available:
- pyg6shard
- pygmalion-350m
Which one do you want to load? 1-2
1
Loading pyg6shard... Loading pyg6shard... [2023-02-28 19:49:33,464] [INFO] [partition_parameters.py:415:exit] finished initializing model with 0.54B parameters Traceback (most recent call last): File "/home/user/AI/AItext/oobabooga/text-generation-webui/server.py", line 185, in
shared.model, shared.tokenizer = load_model(shared.model_name) File "/home/user/AI/AItext/oobabooga/text-generation-webui/modules/models.py", line 73, in load_model model = AutoModelForCausalLM.from_pretrained(Path(f"models/{shared.model_name}"), torch_dtype=torch.bfloat16 if shared.args.bf16 else torch.float16) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained return model_class.from_pretrained( File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2495, in from_pretrained model = cls(config, *model_args, **model_kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, **kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 727, in init self.transformer = GPTJModel(config) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, **kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 480, in init self.h = nn.ModuleList([GPTJBlock(config) for _ in range(config.n_layer)]) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 480, in self.h = nn.ModuleList([GPTJBlock(config) for _ in range(config.n_layer)]) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, **kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 288, in init self.mlp = GPTJMLP(inner_dim, config) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 355, in wrapper f(module, *args, **kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/gptj/modeling_gptj.py", line 268, in init self.fc_in = nn.Linear(embed_dim, intermediate_size) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 363, in wrapper self._post_init_method(module) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 760, in _post_init_method param.partition() File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 894, in partition self._partition(param_list, has_been_updated=has_been_updated) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1038, in _partition self._partition_param(param, has_been_updated=has_been_updated) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 9, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1111, in _partition_param partitioned_tensor = get_accelerator().pin_memory( File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/accelerator/cuda_accelerator.py", line 214, in pin_memory return tensor.pin_memory() RuntimeError: CUDA error: out of memory CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. [2023-02-28 19:49:35,039] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 73 [2023-02-28 19:49:35,040] [ERROR] [launch.py:324:sigkill_handler] ['/root/miniconda3/envs/textgen/bin/python', '-u', 'server.py', '--local_rank=0', '--deepspeed', '--cai-chat'] exits with return code = 1
I've managed to load pyg-350m model just fine with deepspeed. Is deepspeed working incorrectly on WSL? Do you have any clue?
From my experience deepspeed wont run on wsl, or even over docker Maybe someone can prove me wrong
I got it working!; it needs a ton of memory though. More than enough RAM assigned to the VM to load the whole thing in RAM before getting dumped to VRAM.
Also, you need to jump through a ton of hoops to get the cuda toolkit working with conda-forge on WSL (as well as of course fixing WSL's issues with DNS passthrough and GPU passthrough...). But... it works.
I've written a guide on how to do this for total noobs in the context of pygmalion over on the pygmalion subreddit HERE.
TL;DR: WSL2 has a completely broken implementation of DNS and CUDA, and that is the issue. Oh, and the Error -9 that pops up on deepspeed a lot? That's the VM running out of RAM. So you have to reconfigure the amount of RAM the VM gets, deepspeed loads the entire model into sysRAM before offloading to VRAM - and the default 8-12GB allocation is too small.
I'm using docker over wsl2 and I was able to run GPT-NeoXT-Chat-Base-20B
using: python server.py --auto-devices --gpu-memory 8 --cai-chat --load-in-8bit --listen --listen-port 8888 --model=GPT-NeoXT-Chat-Base-20B
which is 38.4GB in size and I didn't need to update the .wslconfig
file
Maybe I'm missing something
Also I was getting error code -11
not -9
This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.