text-generation-inference icon indicating copy to clipboard operation
text-generation-inference copied to clipboard

Canno launch with error exllamav2_kernels not installed.

Open coderaBruce opened this issue 9 months ago • 4 comments

System Info

I am on pytorch2.2.2 cuda 12.1 gcc 10.3.1 Trying to install TGI and run inference locally. I also installed exllamav2 with pip But it pops up error like: 2024-04-30T19:23:48.907883Z INFO text_generation_launcher: Default max_input_tokens to 4095 2024-04-30T19:23:48.907889Z INFO text_generation_launcher: Default max_total_tokens to 4096 2024-04-30T19:23:48.907891Z INFO text_generation_launcher: Default max_batch_prefill_tokens to 4145 2024-04-30T19:23:48.907895Z INFO text_generation_launcher: Using default cuda graphs [1, 2, 4, 8, 16, 32] 2024-04-30T19:23:48.907973Z INFO download: text_generation_launcher: Starting download process. 2024-04-30T19:23:53.010686Z INFO text_generation_launcher: Files are already present on the host. Skipping download.

2024-04-30T19:23:53.814790Z INFO download: text_generation_launcher: Successfully downloaded weights. 2024-04-30T19:23:53.815018Z INFO shard-manager: text_generation_launcher: Starting shard rank=0 2024-04-30T19:23:57.486233Z ERROR text_generation_launcher: exllamav2_kernels not installed.

2024-04-30T19:23:57.543102Z WARN text_generation_launcher: Could not import Flash Attention enabled models: cannot import name 'FastLayerNorm' from 'text_generation_server.utils.layers' (/data3/xli74/LLM/text-generation-inference/server/text_generation_server/utils/layers.py)

2024-04-30T19:23:57.543638Z WARN text_generation_launcher: Could not import Mamba: No module named 'mamba_ssm'

2024-04-30T19:23:58.021319Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

Traceback (most recent call last):

File "/home/xli74/.conda/envs/LLM-TGI/bin/text-generation-server", line 8, in sys.exit(app())

File "/data3/xli74/LLM/text-generation-inference/server/text_generation_server/cli.py", line 71, in serve from text_generation_server import server

File "/data3/xli74/LLM/text-generation-inference/server/text_generation_server/server.py", line 17, in from text_generation_server.models.vlm_causal_lm import VlmCausalLMBatch

File "/data3/xli74/LLM/text-generation-inference/server/text_generation_server/models/vlm_causal_lm.py", line 14, in from text_generation_server.models.flash_mistral import (

File "/data3/xli74/LLM/text-generation-inference/server/text_generation_server/models/flash_mistral.py", line 18, in from text_generation_server.models.custom_modeling.flash_mistral_modeling import (

File "/data3/xli74/LLM/text-generation-inference/server/text_generation_server/models/custom_modeling/flash_mistral_modeling.py", line 30, in from text_generation_server.utils.layers import (

ImportError: cannot import name 'PositionRotaryEmbedding' from 'text_generation_server.utils.layers' (/data3/xli74/LLM/text-generation-inference/server/text_generation_server/utils/layers.py) rank=0 2024-04-30T19:23:58.118890Z ERROR text_generation_launcher: Shard 0 failed to start 2024-04-30T19:23:58.118914Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart

I tried both llama3 7B-instruct and mistral, both with same error. Any help would be greatly appreciated

Information

  • [ ] Docker
  • [X] The CLI directly

Tasks

  • [X] An officially supported command
  • [ ] My own modifications

Reproduction

Following document with conda environment: pytorch2.2.2 cuda 12.1 gcc 10.3.1

After running text-generation-launcher --model-id tiiuae/falcon-7b-instruct --port 8080 Gives the error

Expected behavior

Succesfully launch

coderaBruce avatar Apr 30 '24 19:04 coderaBruce

The same issue

anhou avatar May 01 '24 01:05 anhou

Build and install rotary and layer_norm from flash-attn repository.

Semihal avatar May 02 '24 07:05 Semihal

Build and install rotary and layer_norm from flash-attn repository.

hi @Semihal , can you give the command to build that?

Kev1ntan avatar May 07 '24 01:05 Kev1ntan

Build and install rotary and layer_norm from flash-attn repository.

hi @Semihal , can you give the command to build that?

Clone the flash-attention repository with the same as in this makefile: https://github.com/huggingface/text-generation-inference/blob/main/server/Makefile-flash-att-v2#L7-L12

Then:

  1. Change current dir to layer_norm (from root of flash-attention repo): cd csrc/layer_norm
  2. python setup.py build
  3. python setup.py install
  4. Same for rotary-emb: cd ../rotary
  5. python setup.py build
  6. python setup.py install

Semihal avatar May 07 '24 06:05 Semihal

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jun 12 '24 01:06 github-actions[bot]