h2ogpt
h2ogpt copied to clipboard
How to run h2oGPT Falcon 40B on 1/2/4 older/consumer GPUs using quantization - 3090/3080/4090/4080/V100
Edit below examples for your GPU configuration, modify CUDA_VISIBLE_DEVICES
and MODEL
and add --load_8bit=True
or --load_4bit=True
as needed.
Run Falcon 40B h2oGPT on 4 GPUs - 16 bit (FASTEST)
export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0,1,2,3
python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False &>> logs.$MODEL_NAME.gradio_chat.txt &
Run Falcon 40B h2oGPT on 2 GPUs with 8-bit: - add --load_8bit=True
export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0,1
python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_8bit=True &>> logs.$MODEL_NAME.gradio_chat.txt &
Run Falcon 40B h2oGPT on 1 GPU with 4-bit: add --load_4bit=True
export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0
python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_4bit=True &>> logs.$MODEL_NAME.gradio_chat.txt &
Run Falcon 40B h2oGPT using GPTQ TGI (WIP)
https://github.com/h2oai/h2ogpt/issues/263#issuecomment-1612351796
Trying this, i get an error that V100 CC is sub 7.5 which is insufficient...
In case of 16 bit weights, i'm not even sure why it is needed?
NotImplementedError: Sharded RefinedWeb requires a CUDA device with capability 7.5, > 8.0 or 9.0. No compatible CUDA device found. rank=0 2023-06-30T14:05:57.681529Z ERROR text_generation_launcher: Shard 1 failed to start: /opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError(
NotImplementedError: Sharded RefinedWeb requires a CUDA device with capability 7.5, > 8.0 or 9.0. No compatible CUDA device found.
2023-06-30T14:05:57.681604Z INFO text_generation_launcher: Shutting down shards 2023-06-30T14:05:59.713587Z INFO text_generation_launcher: Shard 0 terminated Error: ShardCannotStart
... driver is 470.182.03 | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |
(tried also uninstalling the bnb 0.39 and installing 0.38.1 - still the same error)
can you try directly without TGI:
CUDA_VISIBLE_DEVICES=0,1,2,3 SAVE_DIR=./save40b python generate.py --base_model=$MODEL --height=500 --debug --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False &>> logs.$MODEL_NAME.gradio_chat.txt &
updated above to skip text-generation-inference server, which is mainly designed for 80GB GPUs.
Thanks Arno, that works. And indeed it is visibly faster than the 4-bit quantized flavor running on 1 v100!
Unfortunately not working on my dual 3090 machine:
nvidia-smi
Wed Jul 5 00:12:13 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 66% 54C P8 40W / 370W | 5MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:05:00.0 On | N/A |
| 0% 43C P8 30W / 370W | 274MiB / 24576MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1791 G /usr/lib/xorg/Xorg 4MiB |
| 1 N/A N/A 1791 G /usr/lib/xorg/Xorg 112MiB |
| 1 N/A N/A 1967 G /usr/bin/gnome-shell 48MiB |
| 1 N/A N/A 3773 G ...308873437204420081,262144 110MiB |
+-----------------------------------------------------------------------------+
envs:
$ echo $MODEL
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
$ echo $CUDA_VISIBLE_DEVICES
0,1
, running:
python3 generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_4bit=True
leads to this out-of-memory error:
│ /home/peter/.local/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py:91 in │
│ set_module_quantized_tensor_to_device │
│ │
│ 88 │ │ │ if is_8bit: │
│ 89 │ │ │ │ new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs). │
│ 90 │ │ │ elif is_4bit: │
│ ❱ 91 │ │ │ │ new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs). │
│ 92 │ │ │ │
│ 93 │ │ │ module._parameters[tensor_name] = new_value │
│ 94 │ │ │ if fp16_statistics is not None: │
│ │
│ /home/peter/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:176 in to │
│ │
│ 173 │ │ device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, * │
│ 174 │ │ │
│ 175 │ │ if (device is not None and device.type == "cuda" and self.data.device.type == "c │
│ ❱ 176 │ │ │ return self.cuda(device) │
│ 177 │ │ else: │
│ 178 │ │ │ s = self.quant_state │
│ 179 │ │ │ if s is not None: │
│ │
│ /home/peter/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:153 in cuda │
│ │
│ 150 │ │ return self │
│ 151 │ │
│ 152 │ def cuda(self, device): │
│ ❱ 153 │ │ w = self.data.contiguous().half().cuda(device) │
│ 154 │ │ w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize, │
│ 155 │ │ self.data = w_4bit │
│ 156 │ │ self.quant_state = quant_state │
│ │
│ /home/peter/.local/lib/python3.10/site-packages/torch/utils/_device.py:62 in __torch_function__ │
│ │
│ 59 │ │ kwargs = kwargs or {} │
│ 60 │ │ if func in _device_constructors() and kwargs.get('device') is None: │
│ 61 │ │ │ kwargs['device'] = self.device │
│ ❱ 62 │ │ return func(*args, **kwargs) │
│ 63 │
│ 64 # NB: This is directly called from C++ in torch/csrc/Device.cpp │
│ 65 def device_decorator(device, func): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 22.07 GiB already allocated; 354.81 MiB free; 22.71 GiB reserved in total by PyTorch) If reserved
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Theoretically 48GB between the 2x 3090 should be enough to load 40B model (even maybe in 8 bit), but i'm not sure if this is expected to work out of the box. Briefly looking at generate.py - it seems to use 'device_map = auto' - not sure if there is a way to specify max_mem per GPU... maybe try running that container before generate...
thanks @barsuna , there is a comment about specifying max_mem per GPU in generate.py line 696:
if model is not None:
# NOTE: Can specify max_memory={0: max_mem, 1: max_mem}, to shard model
# NOTE: Some models require avoiding sharding some layers,
# then would pass no_split_module_classes and give list of those layers.
from accelerate import infer_auto_device_map
device_map = infer_auto_device_map(
model,
max_memory={0: '21GB', 1: '21GB'}, # <-------------------------- I added this
dtype=torch.float16 if load_half else torch.float32,
)
I added max_memory={0: '21GB', 1: '21GB'}
(also added to loader initiation) but none worked.
Could you elaborate on try running that container part?
I got it to work (both 8bit and 4bit) by commenting out the following in gen.py: But it is very very slow, almost 1 sec per word in the UI. Not sure where the bottleneck is.
The problem is here: https://github.com/h2oai/h2ogpt/blob/853fbc3317a193b4d5b4682fc88374bccfb100f5/src/gen.py#L1055 if one just sets "auto" then it spreads
Thanks @pseudotensor !The generation speed is much better now (~10 English tokens per second, and 5 Chinese tokens per second) after pulling the latest changes
I couldnt get the 40b model to load on a 3080 with 4bit quant.. anyone have success with a single 12GB GPU?
I think this can be closed. New data collection elsewhere etc.