text-generation-inference
text-generation-inference copied to clipboard
quantization of falcon 40b
System Info
running on single a100 with 16c and 128g ram
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [ ] An officially supported command
- [X] My own modifications
Reproduction
docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize gptq
docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize bitsandbytes
for gptq
You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 220, in get_model
return FlashRW(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 68, in __init__
self.load_weights(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 113, in load_weights
model.post_load_weights(quantize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 732, in post_load_weights
self.transformer.post_load_weights(quantize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 629, in post_load_weights
layer.self_attention.query_key_value.prepare_weights(quantize)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 53, in prepare_weights
raise NotImplementedError("`gptq` is not implemented for now")
NotImplementedError: `gptq` is not implemented for now
for bitsandbytes
/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantizatio
n are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))
File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
model = get_model(model_id, revision, sharded, quantize, trust_remote_code)
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 227, in get_model
return RW(
File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/rw.py", line 22, in __init__
raise ValueError("quantization is not available on CPU")
ValueError: quantization is not available on CPU
Expected behavior
shall load model correctly or we could quantize it first and load it directly(not seeing any docs about it, will tried it later)
Hello!
Can you run the second command with the --env argument? It seems that your GPU is not detected.
docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize bitsandbytes --env
2023-06-08T02:04:38.128465Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Thu Jun 8 02:04:37 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:00:07.0 Off | 0 |
| N/A 28C P0 51W / 400W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2023-06-08T02:04:38.128479Z INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-08T02:04:38.128533Z INFO text_generation_launcher: Starting download process.
2023-06-08T02:04:43.832487Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.
2023-06-08T02:04:44.237015Z INFO text_generation_launcher: Successfully downloaded weights.
2023-06-08T02:04:44.237114Z INFO text_generation_launcher: Starting shard 0
2023-06-08T02:17:37.811248Z INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
rank=0
2023-06-08T02:17:37.813006Z INFO text_generation_launcher: Shard 0 ready in 773.575523411s
2023-06-08T02:17:37.876877Z INFO text_generation_launcher: Starting Webserver
2023-06-08T02:17:39.553950Z INFO text_generation_router: router/src/main.rs:178: Connected
argh...it works this time :) and speed is around 10 tokens / second