text-generation-inference quantization of falcon 40b

System Info

running on single a100 with 16c and 128g ram

Information

[X] Docker
[ ] The CLI directly

Tasks

[ ] An officially supported command
[X] My own modifications

Reproduction


docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize gptq


docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize bitsandbytes

for gptq

You are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 220, in get_model
    return FlashRW(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 68, in __init__
    self.load_weights(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py", line 113, in load_weights
    model.post_load_weights(quantize)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 732, in post_load_weights
    self.transformer.post_load_weights(quantize)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py", line 629, in post_load_weights
    layer.self_attention.query_key_value.prepare_weights(quantize)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py", line 53, in prepare_weights
    raise NotImplementedError("`gptq` is not implemented for now")

NotImplementedError: `gptq` is not implemented for now

for bitsandbytes

/opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantizatio
n are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve
    server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve
    asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

  File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner
    model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py", line 227, in get_model
    return RW(

  File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/rw.py", line 22, in __init__
    raise ValueError("quantization is not available on CPU")

ValueError: quantization is not available on CPU

Expected behavior

shall load model correctly or we could quantize it first and load it directly(not seeing any docs about it, will tried it later)

Jun 07 '23 01:06 DeoLeung

Hello! Can you run the second command with the --env argument? It seems that your GPU is not detected.

docker run --gpus all --shm-size 1g -p 8080:80 -v /mnt/huggingface/hub:/data ghcr.io/huggingface/text-generation-inference:0.8.2 --model-id 'tiiuae/falcon-40b-instruct' --quantize bitsandbytes --env

Jun 07 '23 08:06 OlivierDehaene

2023-06-08T02:04:38.128465Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Thu Jun  8 02:04:37 2023
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  On   | 00000000:00:07.0 Off |                    0 |
   | N/A   28C    P0    51W / 400W |      0MiB / 81920MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+

   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
2023-06-08T02:04:38.128479Z  INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: false, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
2023-06-08T02:04:38.128533Z  INFO text_generation_launcher: Starting download process.
2023-06-08T02:04:43.832487Z  INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-08T02:04:44.237015Z  INFO text_generation_launcher: Successfully downloaded weights.
2023-06-08T02:04:44.237114Z  INFO text_generation_launcher: Starting shard 0
2023-06-08T02:17:37.811248Z  INFO shard-manager: text_generation_launcher: Server started at unix:///tmp/text-generation-server-0
 rank=0
2023-06-08T02:17:37.813006Z  INFO text_generation_launcher: Shard 0 ready in 773.575523411s
2023-06-08T02:17:37.876877Z  INFO text_generation_launcher: Starting Webserver
2023-06-08T02:17:39.553950Z  INFO text_generation_router: router/src/main.rs:178: Connected

argh...it works this time :) and speed is around 10 tokens / second

Jun 08 '23 02:06 DeoLeung

text-generation-inference text-generation-inference copied to clipboard

quantization of falcon 40b

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard