text-generation-inference --quantize bitsandbytes or --quantize gptq does not work.

System Info

Running on runpod.io: The specs on my community pod is 1 x RTX 3090 9 vCPU 37 GB RAM.

Pod on runpod.io that runs successfully:

gpu_count = 1
num_shard = gpu_count
model_id = "tiiuae/falcon-7b-instruct"
quantize = "bitsandbytes"
# gpu_type = "NVIDIA RTX A5000"
gpu_type = "NVIDIA GeForce RTX 3090"
pod = runpod.create_pod(
   name = model_id,
   image_name="ghcr.io/huggingface/text-generation-inference:1.0.0",
   gpu_type_id=gpu_type,
   cloud_type="COMMUNITY",
   docker_args=f"--model-id {model_id} --num-shard {num_shard}",
   gpu_count=1,
   volume_in_gb=30,
   container_disk_in_gb=5,
   ports="80/http",
   volume_mount_path="/data",

)

Does not run successfully:

- all settings the same except docker_args:

docker_args=f"--model-id {model_id} --num-shard {num_shard} --quantize bitsandbytes",

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Configured with quantized fails:

gpu_count = 1
num_shard = gpu_count
model_id = "tiiuae/falcon-7b-instruct"
**quantize = "bitsandbytes"** # Note: Also fails with gptq
# gpu_type = "NVIDIA RTX A5000"
gpu_type = "NVIDIA GeForce RTX 3090"
pod = runpod.create_pod(
   name = model_id,
   image_name="ghcr.io/huggingface/text-generation-inference:1.0.0",
   gpu_type_id=gpu_type,
   cloud_type="COMMUNITY",
   docker_args=f"--model-id {model_id} --num-shard {num_shard} --quantize {quantize}",
   gpu_count=1,
   volume_in_gb=30,
   container_disk_in_gb=5,
   ports="80/http",
   volume_mount_path="/data",

)
**ERROR**

2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. 2023-08-10T11:30:29.101223402-06:00 warn("The installed version of bitsandbytes was compiled without GPU support. " 2023-08-10T11:30:29.101226592-06:00 Traceback (most recent call last): 2023-08-10T11:30:29.101229382-06:00 2023-08-10T11:30:29.101232052-06:00 File "/opt/conda/bin/text-generation-server", line 8, in 2023-08-10T11:30:29.101235462-06:00 sys.exit(app()) 2023-08-10T11:30:29.101238552-06:00 2023-08-10T11:30:29.101241232-06:00 File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve 2023-08-10T11:30:29.101244192-06:00 server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) 2023-08-10T11:30:29.101246932-06:00 2023-08-10T11:30:29.101249522-06:00 File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve 2023-08-10T11:30:29.101252552-06:00 asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) 2023-08-10T11:30:29.101255252-06:00 2023-08-10T11:30:29.101257762-06:00 File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run 2023-08-10T11:30:29.101260462-06:00 return loop.run_until_complete(main) 2023-08-10T11:30:29.101263162-06:00 2023-08-10T11:30:29.101265752-06:00 File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete 2023-08-10T11:30:29.101268502-06:00 return future.result() 2023-08-10T11:30:29.101271532-06:00 2023-08-10T11:30:29.101274092-06:00 File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner 2023-08-10T11:30:29.101276832-06:00 model = get_model(model_id, revision, sharded, quantize, trust_remote_code) 2023-08-10T11:30:29.101279522-06:00 2023-08-10T11:30:29.101282092-06:00 File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 227, in get_model 2023-08-10T11:30:29.101299592-06:00 return RW( 2023-08-10T11:30:29.101303172-06:00 2023-08-10T11:30:29.101307022-06:00 File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/rw.py", line 22, in init 2023-08-10T11:30:29.101309862-06:00 raise ValueError("quantization is not available on CPU") 2023-08-10T11:30:29.101313082-06:00 2023-08-10T11:30:29.101315592-06:00 ValueError: quantization is not available on CPU

**NOTE: also fails with setting quantize to gptq.  The error was a `signal 4`.**

### Expected behavior

I expected the --quantize parameter to work and make inference faster w/ lower memory footprint.

Note: Perhaps I misread the documentation (I found it a bit confusing):
From the readme:

Quantization You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:

make run-falcon-7b-instruct-quantize

4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.

does this mean the model needs to be remade? Also - the option "bitsandbytes-nf4" and "bitsandbytes-fp4" are not available options. I found "bitsandbytes" and "gptq" to be acceptable options ?

Thank you.

Aug 11 '23 11:08 solarslurpi

does this mean the model needs to be remade? Also - the option "bitsandbytes-nf4" and "bitsandbytes-fp4" are not available options. I found "bitsandbytes" and "gptq" to be acceptable options ?

You need to use latest for that.

2023-08-10T11:30:29.101315592-06:00 ValueError: quantization is not available on CPU

This means for whatever reason the pod you're using cannot see the GPU. There are issues opened for that directly at runpods if I'm not mistakend.

Aug 11 '23 12:08 Narsil

Thank you. From earlier in the error, it states:

2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.

Does that mean that perhaps the issue is in the bitsandbytes python library and not that the pod can't see the GPU? I assume there is a requirements.txt file for the docker container (I apologize I don't know where it is)....?? Thank you.

Aug 11 '23 15:08 solarslurpi

@solarslurpi it seems more to me that the GPU is not detected in the docker image, and that error message is bogus stemming from that. (I can run fine with 1.0.0 with bnb on a simple docker + gpu environement).

Aug 11 '23 15:08 Narsil

I wonder if it is the GPU type. I am "renting" 1 x RTX 3090 9 vCPU 37 GB RAM. What GPU did you use? Thank you.

Aug 11 '23 15:08 solarslurpi

A10G, but the choice of GPU doesn't matter, tgi works on 3090 for sure. But we've seen people having issues with runpod before.

Something about shm not being properly set or something.

Aug 11 '23 15:08 Narsil

this might be community cloud related, as posted in https://github.com/runpod/runpod-python/issues/90#issuecomment-1694621115 I was successfully able to run your config (different model) via runpod SECURE with the same quantization settings you chose.

Aug 28 '23 09:08 chris-aeviator

Sorry I thought this was fixed on the runpod side. Re-opening.

Sep 06 '23 13:09 OlivierDehaene

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Apr 12 '24 01:04 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

--quantize bitsandbytes or --quantize gptq does not work.

System Info

Information

Tasks

Reproduction

text-generation-inference
text-generation-inference copied to clipboard