text-generation-inference
text-generation-inference copied to clipboard
Not able to install locally
System Info
2024-04-22T09:19:51.209245Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.75.0
Commit sha: N/A
Docker label: N/A
nvidia-smi:
Mon Apr 22 09:19:50 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:00:04.0 Off | 0 |
| N/A 29C P0 42W / 400W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2024-04-22T09:19:51.209446Z INFO text_generation_launcher: Args { model_id: "bigscience/bloom-560m", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: None, speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: None, weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: true, max_client_batch_size: 4 }
2024-04-22T09:19:51.209835Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:19:51.209844Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:19:51.209847Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:19:51.209850Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:19:51.210103Z INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:19:55.920267Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-22T09:19:56.615746Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:19:56.616115Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:20:01.251224Z ERROR text_generation_launcher: exllamav2_kernels not installed.
2024-04-22T09:20:01.286558Z WARN text_generation_launcher: We're not using custom kernels.
2024-04-22T09:20:01.329486Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
2024-04-22T09:20:01.355485Z WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
2024-04-22T09:20:02.122101Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
from text_generation_server.models.flash_mistral import (
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
from text_generation_server.utils.layers import (
ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
rank=0
2024-04-22T09:20:02.220814Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:20:02.220836Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
Information
- [ ] Docker
- [x] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
I have a local model quantised with autoawq; even tried with bloke awq for llama 2 7b from hf directly use the command:
# ================= with local install =================
method="awq"
model="/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-$method"
# model=""
text-generation-launcher --model-id "$model" --quantize $method --huggingface-hub-cache $HUGGINGFACE_CACHE 2>&1 | tee "tgi-$method.log"
Expected behavior
The server should start;
I have all the packages installed using the commands mentioned to install using make
(venv) shwu@a100-spot-altzone-1:~/labs/TGI$ python -c "import pip._internal.operations.freeze; print('\n'.join([p for p in pip._internal.operations.freeze.freeze() if 'exllama' in p or 'vllm' in p or 'flash' in p]))" && bash generate.sh
exllamav2_kernels==0.0.0
flash_attn==2.5.6
vllm==0.4.0.post1+cu122
2024-04-22T09:22:26.077582Z INFO text_generation_launcher: Args { model_id: "/home/shwu/labs/TGI/models/meta-llama/Llama-2-7b-hf-awq", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Awq), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: None, max_total_tokens: None, waiting_served_ratio: 1.2, max_batch_prefill_tokens: None, max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: "0.0.0.0", port: 3000, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some(".cache/"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }
2024-04-22T09:22:26.077989Z INFO text_generation_launcher: Default `max_input_tokens` to 4095
2024-04-22T09:22:26.077998Z INFO text_generation_launcher: Default `max_total_tokens` to 4096
2024-04-22T09:22:26.078001Z INFO text_generation_launcher: Default `max_batch_prefill_tokens` to 4145
2024-04-22T09:22:26.078003Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32]
2024-04-22T09:22:26.078233Z INFO download: text_generation_launcher: Starting download process.
2024-04-22T09:22:30.659168Z INFO text_generation_launcher: Files are already present on the host. Skipping download.
2024-04-22T09:22:31.283684Z INFO download: text_generation_launcher: Successfully downloaded weights.
2024-04-22T09:22:31.284013Z INFO shard-manager: text_generation_launcher: Starting shard rank=0
2024-04-22T09:22:36.095219Z ERROR text_generation_launcher: exllamav2_kernels not installed.
2024-04-22T09:22:36.131032Z WARN text_generation_launcher: We're not using custom kernels.
2024-04-22T09:22:36.174655Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
2024-04-22T09:22:36.201589Z WARN text_generation_launcher: Could not import Mamba: cannot import name 'FastRMSNorm' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
2024-04-22T09:22:36.890726Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/home/shwu/labs/TGI/venv/bin/text-generation-server", line 8, in <module>
sys.exit(app())
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/cli.py", line 71, in serve
from text_generation_server import server
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/server.py", line 16, in <module>
from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/vlm_causal_lm.py", line 14, in <module>
from text_generation_server.models.flash_mistral import (
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/flash_mistral.py", line 18, in <module>
from text_generation_server.models.custom_modeling.flash_mistral_modeling import (
File "/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in <module>
from text_generation_server.utils.layers import (
ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/home/shwu/labs/TGI/text-generation-inference-2.0.1/server/text_generation_server/utils/layers.py)
rank=0
2024-04-22T09:22:36.989104Z ERROR text_generation_launcher: Shard 0 failed to start
2024-04-22T09:22:36.989127Z INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart```
You need to re-install vllm and flash-attention-v2 `cd text-generation-inference/server rm -rf vllm make install-vllm-cuda
rm -rf flash-attention-v2 make install-flash-attention-v2-cuda`
They forgot to add this to the release notes about local installs. https://github.com/huggingface/text-generation-inference/issues/1738 I tried this and solved my problem.
I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation
I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation
I feel you, did exactly the same. install/delete about 4 times
I have been installing all of the extensions via those commands for 2 days now; I also tried using the release v2.0.1 code zip let me try this once more with a clean installation
You can follow the steps in the Dockerfile, after compile flash-attn with cmd 'make install-flash..‘, the script moves the compiled file to python's site-package folder, just like cp -r /text-generation-inference/server/flash-attention-v2/build/lib.linux-x86_64-cpython-39/* /usr/local/lib/python3.10/site-packages/
have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts
Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run
- scripts/install-tgi.sh , then
- scripts/parallel-install-extensions.sh (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)
use other scripts in the directory as required.
for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt
ps. maintainer can close this. leaving open for anyone facing a similar issue.
have resolved the issues using the following set of install-scripts; https://github.com/nyunAI/Faster-LLM-Survey/tree/A100TGIv2.0.1/scripts
Usually, if u have required version of cmake, libkineto, protobuff & rust installed you can directly run
1. [scripts/install-tgi.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/install-tgi.sh) , then 2. [scripts/parallel-install-extensions.sh](https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/scripts/parallel-install-extensions.sh) (this parallely installs all extensions - flash-attn, flash-attn-v2-cuda, vllm-cuda, exllamav2_kernels, etc.)use other scripts in the directory as required.
for other system and driver details see - https://github.com/nyunAI/Faster-LLM-Survey/blob/A100TGIv2.0.1/experiment_details.txt
ps. maintainer can close this. leaving open for anyone facing a similar issue.
When install vllm for TGI-2.0.1, I came across :
error: triton 2.3.0 is installed but triton==2.1.0 is required by {'torch'}
make: *** [Makefile-vllm:12: install-vllm-cuda] Error 1
Is this because I use wrong vllm version. I don't modify anything in the Makefile-* scriot
Your PyTorch version might be different. I faced this issue for the same reason that my PyTorch version was higher than torch==2.1.0 and hence the default triton that was installed was 2.2.0 (afair). Nonetheless, use a fresh virtual env (maybe conda)
install torch==2.1.0 or use install-tgi.sh
Build and install rotary and layer_norm from https://github.com/Dao-AILab/flash-attention/tree/main/csrc. This work for me
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.