text-generation-inference Not able to install locally

System Info

2024-04-22T09:19:51.209245Z  INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Mon Apr 22 09:19:50 2024       
   +-----------------------------------------------------------------------------+
   | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
   |-------------------------------+----------------------+----------------------+
   | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   |                               |                      |               MIG M. |
   |===============================+======================+======================|
   |   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
   | N/A   29C    P0    42W / 400W |      0MiB / 40960MiB |      0%      Default |
   |                               |                      |             Disabled |
   +-------------------------------+----------------------+----------------------+
                                                                                  
   +-----------------------------------------------------------------------------+
   | Processes:                                                                  |
   |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
   |        ID   ID                                                   Usage      |
   |=============================================================================|
   |  No running processes found                                                 |
   +-----------------------------------------------------------------------------+
2024-04-22T09:19:51.209446Z  INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }
2024-04-22T09:19:51.209835Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:19:51.209844Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:19:51.209847Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:19:51.209850Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:19:51.210103Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:19:55.920267Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:19:56.615746Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:19:56.616115Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:20:01.251224Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:20:01.286558Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:20:01.329486Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:01.355485Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:20:02.122101Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:20:02.220814Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:20:02.220836Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart

Information

[ ] Docker
[x] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

I have a local model quantised with autoawq; even tried with bloke awq for llama 2 7b from hf directly use the command:

# ================= with local install =================
method="awq"
model="/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-$method"
# model=""

text-generation-launcher --model-id "$model" --quantize $method --huggingface-hub-cache $HUGGINGFACE_CACHE 2>&1 | tee "tgi-$method.log"

Expected behavior

The server should start;

I have all the packages installed using the commands mentioned to install using make

(venv) shwu@a100-spot-altzone-1:~/labs/TGI$ python -c "import pip._internal.operations.freeze; print('\n'.join([p for p in pip._internal.operations.freeze.freeze() if 'exllama' in p or 'vllm' in p or 'flash' in p]))" && bash generate.sh 
exllamav2_kernels==0.0.0
flash_attn==2.5.6
vllm==0.4.0.post1+cu122
2024-04-22T09:22:26.077582Z  INFO text_generation_launcher: Args { model_id: "/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-awq", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some(".cache/"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-22T09:22:26.077989Z  INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:22:26.077998Z  INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:22:26.078001Z  INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:22:26.078003Z  INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:22:26.078233Z  INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:22:30.659168Z  INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-22T09:22:31.283684Z  INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:22:31.284013Z  INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:22:36.095219Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-22T09:22:36.131032Z  WARN text_generation_launcher: We're not using custom kernels.

2024-04-22T09:22:36.174655Z  WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.201589Z  WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)

2024-04-22T09:22:36.890726Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

  File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
    from text_generation_server import server

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
    from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
    from text_generation_server.models.flash_mistral import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
    from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

  File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
    from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
 rank=0
2024-04-22T09:22:36.989104Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:22:36.989127Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```

Apr 22 '24 09:04 shwu-nyunai

You need to re-install vllm and flash-attention-v2 `cd text-generation-inference/server rm -rf vllm make install-vllm-cuda

rm -rf flash-attention-v2 make install-flash-attention-v2-cuda`

They forgot to add this to the release notes about local installs. https://github.com/huggingface/text-generation-inference/issues/1738 I tried this and solved my problem.

Apr 22 '24 12:04 shuaills

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

Apr 22 '24 14:04 shwu-nyunai

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

I feel you, did exactly the same. install/delete about 4 times

Apr 22 '24 14:04 shuaills

I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation

You can follow the steps in the Dockerfile, after compile flash-attn with cmd 'make install-flash..‘, the script moves the compiled file to python's site-package folder, just like cp -r /text-generation-inference/server/flash-attention-v2/build/lib.linux-x86_64-cpython-39/* /usr/local/lib/python3.10/site-packages/

Apr 23 '24 08:04 boxiaowave

have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run

scripts/install-tgi.sh , then
scripts/parallel-install-extensions.sh (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)

use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

Apr 23 '24 13:04 shwu-nyunai

have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts

Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run
1. [scripts/install-tgi.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/install-tgi.sh) , then

2. [scripts/parallel-install-extensions.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/parallel-install-extensions.sh) (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)
use other scripts in the directory as required.

for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt

ps. maintainer can close this. leaving open for anyone facing a similar issue.

When install vllm for TGI-2.0.1, I came across :

error: triton 2.3.0 is installed but triton==2.1.0 is required by {'torch'}
make: *** [Makefile-vllm:12: install-vllm-cuda] Error 1

Is this because I use wrong vllm version. I don't modify anything in the Makefile-* scriot

Apr 25 '24 13:04 for-just-we

Your PyTorch version might be different. I faced this issue for the same reason that my PyTorch version was higher than torch==2.1.0 and hence the default triton that was installed was 2.2.0 (afair). Nonetheless, use a fresh virtual env (maybe conda)

install torch==2.1.0 or use install-tgi.sh

Apr 25 '24 17:04 shwu-nyunai

Build and install rotary and layer_norm from https://github.com/Dao-AILab/flash-attention/tree/main/csrc. This work for me

Apr 27 '24 15:04 Semihal

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

May 28 '24 01:05 github-actions[bot]

text-generation-inference text-generation-inference copied to clipboard

Not able to install locally

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard