h2ogpt icon indicating copy to clipboard operation
h2ogpt copied to clipboard

How to run h2oGPT Falcon 40B on 1/2/4 older/consumer GPUs using quantization - 3090/3080/4090/4080/V100

Open arnocandel opened this issue 1 year ago • 9 comments

Edit below examples for your GPU configuration, modify CUDA_VISIBLE_DEVICES and MODEL and add --load_8bit=True or --load_4bit=True as needed.

Run Falcon 40B h2oGPT on 4 GPUs - 16 bit (FASTEST)

export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0,1,2,3

python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False &>> logs.$MODEL_NAME.gradio_chat.txt &

Run Falcon 40B h2oGPT on 2 GPUs with 8-bit: - add --load_8bit=True

export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0,1

python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_8bit=True &>> logs.$MODEL_NAME.gradio_chat.txt &

Run Falcon 40B h2oGPT on 1 GPU with 4-bit: add --load_4bit=True

export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
#export MODEL=h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3
export GRADIO_SERVER_PORT=2001
export MODEL_NAME=`echo $MODEL | sed 's@/@_@g'`
export CUDA_VISIBLE_DEVICES=0

python generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_4bit=True &>> logs.$MODEL_NAME.gradio_chat.txt &

Run Falcon 40B h2oGPT using GPTQ TGI (WIP)

https://github.com/h2oai/h2ogpt/issues/263#issuecomment-1612351796

arnocandel avatar Jun 30 '23 02:06 arnocandel

Trying this, i get an error that V100 CC is sub 7.5 which is insufficient...

In case of 16 bit weights, i'm not even sure why it is needed?

NotImplementedError: Sharded RefinedWeb requires a CUDA device with capability 7.5, > 8.0 or 9.0. No compatible CUDA device found. rank=0 2023-06-30T14:05:57.681529Z ERROR text_generation_launcher: Shard 1 failed to start: /opt/conda/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:145: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU! warn(msg) Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result()

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError(

NotImplementedError: Sharded RefinedWeb requires a CUDA device with capability 7.5, > 8.0 or 9.0. No compatible CUDA device found.

2023-06-30T14:05:57.681604Z INFO text_generation_launcher: Shutting down shards 2023-06-30T14:05:59.713587Z INFO text_generation_launcher: Shard 0 terminated Error: ShardCannotStart

... driver is 470.182.03 | NVIDIA-SMI 470.182.03 Driver Version: 470.182.03 CUDA Version: 11.4 |

(tried also uninstalling the bnb 0.39 and installing 0.38.1 - still the same error)

barsuna avatar Jun 30 '23 14:06 barsuna

can you try directly without TGI: CUDA_VISIBLE_DEVICES=0,1,2,3 SAVE_DIR=./save40b python generate.py --base_model=$MODEL --height=500 --debug --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False &>> logs.$MODEL_NAME.gradio_chat.txt &

updated above to skip text-generation-inference server, which is mainly designed for 80GB GPUs.

arnocandel avatar Jun 30 '23 16:06 arnocandel

Thanks Arno, that works. And indeed it is visibly faster than the 4-bit quantized flavor running on 1 v100!

barsuna avatar Jun 30 '23 17:06 barsuna

Unfortunately not working on my dual 3090 machine:

nvidia-smi
Wed Jul  5 00:12:13 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| 66%   54C    P8    40W / 370W |      5MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:05:00.0  On |                  N/A |
|  0%   43C    P8    30W / 370W |    274MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1791      G   /usr/lib/xorg/Xorg                  4MiB |
|    1   N/A  N/A      1791      G   /usr/lib/xorg/Xorg                112MiB |
|    1   N/A  N/A      1967      G   /usr/bin/gnome-shell               48MiB |
|    1   N/A  N/A      3773      G   ...308873437204420081,262144      110MiB |
+-----------------------------------------------------------------------------+

envs:

$ echo $MODEL
h2oai/h2ogpt-gm-oasst1-en-2048-falcon-40b-v2
$ echo $CUDA_VISIBLE_DEVICES
0,1

, running:

python3 generate.py --base_model=$MODEL --langchain_mode=ChatLLM --visible_langchain_modes="['ChatLLM', 'UserData', 'MyData']" --score_model=None --max_max_new_tokens=2048 --max_new_tokens=512 --infer_devices=False --load_4bit=True 

leads to this out-of-memory error:

│ /home/peter/.local/lib/python3.10/site-packages/transformers/utils/bitsandbytes.py:91 in         │
│ set_module_quantized_tensor_to_device                                                            │
│                                                                                                  │
│    88 │   │   │   if is_8bit:                                                                    │
│    89 │   │   │   │   new_value = bnb.nn.Int8Params(new_value, requires_grad=False, **kwargs).   │
│    90 │   │   │   elif is_4bit:                                                                  │
│ ❱  91 │   │   │   │   new_value = bnb.nn.Params4bit(new_value, requires_grad=False, **kwargs).   │
│    92 │   │   │                                                                                  │
│    93 │   │   │   module._parameters[tensor_name] = new_value                                    │
│    94 │   │   │   if fp16_statistics is not None:                                                │
│                                                                                                  │
│ /home/peter/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:176 in to             │
│                                                                                                  │
│   173 │   │   device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, *   │
│   174 │   │                                                                                      │
│   175 │   │   if (device is not None and device.type == "cuda" and self.data.device.type == "c   │
│ ❱ 176 │   │   │   return self.cuda(device)                                                       │
│   177 │   │   else:                                                                              │
│   178 │   │   │   s = self.quant_state                                                           │
│   179 │   │   │   if s is not None:                                                              │
│                                                                                                  │
│ /home/peter/.local/lib/python3.10/site-packages/bitsandbytes/nn/modules.py:153 in cuda           │
│                                                                                                  │
│   150 │   │   return self                                                                        │
│   151 │                                                                                          │
│   152 │   def cuda(self, device):                                                                │
│ ❱ 153 │   │   w = self.data.contiguous().half().cuda(device)                                     │
│   154 │   │   w_4bit, quant_state = bnb.functional.quantize_4bit(w, blocksize=self.blocksize,    │
│   155 │   │   self.data = w_4bit                                                                 │
│   156 │   │   self.quant_state = quant_state                                                     │
│                                                                                                  │
│ /home/peter/.local/lib/python3.10/site-packages/torch/utils/_device.py:62 in __torch_function__  │
│                                                                                                  │
│   59 │   │   kwargs = kwargs or {}                                                               │
│   60 │   │   if func in _device_constructors() and kwargs.get('device') is None:                 │
│   61 │   │   │   kwargs['device'] = self.device                                                  │
│ ❱ 62 │   │   return func(*args, **kwargs)                                                        │
│   63                                                                                             │
│   64 # NB: This is directly called from C++ in torch/csrc/Device.cpp                             │
│   65 def device_decorator(device, func):                                                         │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.70 GiB total capacity; 22.07 GiB already allocated; 354.81 MiB free; 22.71 GiB reserved in total by PyTorch) If reserved 
memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

jinqiupeter avatar Jul 04 '23 16:07 jinqiupeter

Theoretically 48GB between the 2x 3090 should be enough to load 40B model (even maybe in 8 bit), but i'm not sure if this is expected to work out of the box. Briefly looking at generate.py - it seems to use 'device_map = auto' - not sure if there is a way to specify max_mem per GPU... maybe try running that container before generate...

barsuna avatar Jul 05 '23 17:07 barsuna

thanks @barsuna , there is a comment about specifying max_mem per GPU in generate.py line 696:

    if model is not None:
        # NOTE: Can specify max_memory={0: max_mem, 1: max_mem}, to shard model
        # NOTE: Some models require avoiding sharding some layers,
        # then would pass no_split_module_classes and give list of those layers.
        from accelerate import infer_auto_device_map
        device_map = infer_auto_device_map(
            model,
            max_memory={0: '21GB', 1: '21GB'},  # <-------------------------- I added this
            dtype=torch.float16 if load_half else torch.float32,
        )

I added max_memory={0: '21GB', 1: '21GB'} (also added to loader initiation) but none worked.

Could you elaborate on try running that container part?

jinqiupeter avatar Jul 06 '23 16:07 jinqiupeter

I got it to work (both 8bit and 4bit) by commenting out the following in gen.py: Screenshot from 2023-07-10 23-59-36 But it is very very slow, almost 1 sec per word in the UI. Not sure where the bottleneck is.

jinqiupeter avatar Jul 10 '23 16:07 jinqiupeter

The problem is here: https://github.com/h2oai/h2ogpt/blob/853fbc3317a193b4d5b4682fc88374bccfb100f5/src/gen.py#L1055 if one just sets "auto" then it spreads

image

pseudotensor avatar Jul 11 '23 08:07 pseudotensor

Thanks @pseudotensor !The generation speed is much better now (~10 English tokens per second, and 5 Chinese tokens per second) after pulling the latest changes

jinqiupeter avatar Jul 12 '23 14:07 jinqiupeter

I couldnt get the 40b model to load on a 3080 with 4bit quant.. anyone have success with a single 12GB GPU?

auwsom avatar Jul 20 '23 05:07 auwsom

I think this can be closed. New data collection elsewhere etc.

pseudotensor avatar Aug 31 '23 03:08 pseudotensor