text-generation-inference
                                
                                
                                
                                    text-generation-inference copied to clipboard
                            
                            
                            
                        --quantize bitsandbytes or --quantize gptq does not work.
System Info
Running on runpod.io: The specs on my community pod is 1 x RTX 3090 9 vCPU 37 GB RAM.
Pod on runpod.io that runs successfully:
gpu_count = 1
num_shard = gpu_count
model_id = "tiiuae/falcon-7b-instruct"
quantize = "bitsandbytes"
# gpu_type = "NVIDIA RTX A5000"
gpu_type = "NVIDIA GeForce RTX 3090"
pod = runpod.create_pod(
   name = model_id,
   image_name="ghcr.io/huggingface/text-generation-inference:1.0.0",
   gpu_type_id=gpu_type,
   cloud_type="COMMUNITY",
   docker_args=f"--model-id {model_id} --num-shard {num_shard}",
   gpu_count=1,
   volume_in_gb=30,
   container_disk_in_gb=5,
   ports="80/http",
   volume_mount_path="/data",
)
Does not run successfully:
- all settings the same except docker_args:
docker_args=f"--model-id {model_id} --num-shard {num_shard} --quantize bitsandbytes",
Information
- [X] Docker
 - [ ] The CLI directly
 
Tasks
- [X] An officially supported command
 - [ ] My own modifications
 
Reproduction
Configured with quantized fails:
gpu_count = 1
num_shard = gpu_count
model_id = "tiiuae/falcon-7b-instruct"
**quantize = "bitsandbytes"** # Note: Also fails with gptq
# gpu_type = "NVIDIA RTX A5000"
gpu_type = "NVIDIA GeForce RTX 3090"
pod = runpod.create_pod(
   name = model_id,
   image_name="ghcr.io/huggingface/text-generation-inference:1.0.0",
   gpu_type_id=gpu_type,
   cloud_type="COMMUNITY",
   docker_args=f"--model-id {model_id} --num-shard {num_shard} --quantize {quantize}",
   gpu_count=1,
   volume_in_gb=30,
   container_disk_in_gb=5,
   ports="80/http",
   volume_mount_path="/data",
)
**ERROR**
2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
2023-08-10T11:30:29.101223402-06:00   warn("The installed version of bitsandbytes was compiled without GPU support. "
2023-08-10T11:30:29.101226592-06:00 Traceback (most recent call last):
2023-08-10T11:30:29.101229382-06:00
2023-08-10T11:30:29.101232052-06:00   File "/opt/conda/bin/text-generation-server", line 8, in 
**NOTE: also fails with setting quantize to gptq.  The error was a `signal 4`.**
### Expected behavior
I expected the --quantize parameter to work and make inference faster w/ lower memory footprint.
Note: Perhaps I misread the documentation (I found it a bit confusing):
From the readme:
Quantization You can also quantize the weights with bitsandbytes to reduce the VRAM requirement:
make run-falcon-7b-instruct-quantize
4bit quantization is available using the NF4 and FP4 data types from bitsandbytes. It can be enabled by providing --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 as a command line argument to text-generation-launcher.
does this mean the model needs to be remade? Also - the option "bitsandbytes-nf4" and "bitsandbytes-fp4" are not available options. I found "bitsandbytes" and "gptq" to be acceptable options ?
Thank you.
                                    
                                    
                                    
                                
does this mean the model needs to be remade? Also - the option "bitsandbytes-nf4" and "bitsandbytes-fp4" are not available options. I found "bitsandbytes" and "gptq" to be acceptable options ?
You need to use latest for that.
2023-08-10T11:30:29.101315592-06:00 ValueError: quantization is not available on CPU
This means for whatever reason the pod you're using cannot see the GPU. There are issues opened for that directly at runpods if I'm not mistakend.
Thank you. From earlier in the error, it states:
2023-08-10T11:30:29.101220272-06:00 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
Does that mean that perhaps the issue is in the bitsandbytes python library and not that the pod can't see the GPU? I assume there is a requirements.txt file for the docker container (I apologize I don't know where it is)....?? Thank you.
@solarslurpi it seems more to me that the GPU is not detected in the docker image, and that error message is bogus stemming from that. (I can run fine with 1.0.0 with bnb on a simple docker + gpu environement).
I wonder if it is the GPU type. I am "renting" 1 x RTX 3090 9 vCPU 37 GB RAM. What GPU did you use? Thank you.
A10G, but the choice of GPU doesn't matter, tgi works on 3090 for sure. But we've seen people having issues with runpod before.
Something about shm not being properly set or something.
this might be community cloud related, as posted in https://github.com/runpod/runpod-python/issues/90#issuecomment-1694621115 I was successfully able to run your config (different model) via runpod SECURE with the same quantization settings you chose.
Sorry I thought this was fixed on the runpod side. Re-opening.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.