text-generation-inference CohereForAI/c4ai-command-r-plus-4bit deployment fails on Inference Endpoint

System Info

Information

[ ] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

Goto model Select CohereForAI/c4ai-command-r-plus-4bit
Deploy on Inference Endpoint
Use standard settings (see picture)
Wait forever

See this output:

2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427626Z","level":"INFO","fields":{"message":"Args { model_id: \"/repository\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), speculate: None, dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_tokens: None, max_input_length: Some(1024), max_total_tokens: Some(1512), waiting_served_ratio: 1.2, max_batch_prefill_tokens: Some(2048), max_batch_total_tokens: None, max_waiting_tokens: 20, max_batch_size: None, cuda_graphs: None, hostname: \"r-h4g3n-c4ai-command-r-plus-4bit-qts-a1e5dl6j-cfd73-rk5q9\", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, tokenizer_config_path: None, disable_grammar_support: false, env: false, max_client_batch_size: 4 }"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427704Z","level":"INFO","fields":{"message":"Model supports up to 8192 but tgi will now set its default to 4096 instead. This is to save VRAM by refusing large prompts in order to allow more users on the same hardware. You can increase that size using `--max-batch-prefill-tokens=8242 --max-total-tokens=8192 --max-input-tokens=8191`."},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427711Z","level":"INFO","fields":{"message":"Bitsandbytes doesn't work with cuda graphs, deactivating them"},"target":"text_generation_launcher"}
2024/04/23 15:46:10 ~ {"timestamp":"2024-04-23T13:46:10.427793Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:14 ~ {"timestamp":"2024-04-23T13:46:14.728238Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.432829Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
2024/04/23 15:46:15 ~ {"timestamp":"2024-04-23T13:46:15.433034Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:22 ~ {"timestamp":"2024-04-23T13:46:22.644366Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.10/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.10/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.10/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 636, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 603, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 1909, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.10/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.041667Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nSpecial tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py\", line 90, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 240, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.10/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.10/asyncio/base_events.py\", line 649, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py\", line 201, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py\", line 375, in get_model\n return FlashCohere(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py\", line 61, in __init__\n model = FlashCohereForCausalLM(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 482, in __init__\n self.model = FlashCohereModel(config, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 420, in __init__\n [\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 421, in <listcomp>\n FlashCohereLayer(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 360, in __init__\n self.self_attn = FlashCohereAttention(\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 217, in __init__\n self.query_key_value = load_attention(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 140, in load_attention\n return _load_gqa(config, prefix, weights)\n\n File \"/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py\", line 167, in _load_gqa\n assert list(weight.shape) == [\n\nAssertionError: [88080384, 1] != [14336, 12288]\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140326Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ {"timestamp":"2024-04-23T13:46:24.140367Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
2024/04/23 15:46:24 ~ Error: ShardCannotStart

It looks like the flash attention mechanism ist not working properly.

Expected behavior

Expected behaviour is to start normally.

Note: I have not been able to get it running with any cohere model on any hardware configuration.

Apr 23 '24 13:04 h4gen

Fails the same way for me on a gcp vm with

docker run --gpus all --shm-size 1g -p 8888:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 2

File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 90, in serve server.serve(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 240, in serve asyncio.run(

File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 201, in serve_inner model = get_model(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 375, in get_model return FlashCohere(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/flash_cohere.py", line 61, in init model = FlashCohereForCausalLM(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 482, in init self.model = FlashCohereModel(config, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 420, in init [

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 421, in FlashCohereLayer(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 360, in init self.self_attn = FlashCohereAttention(

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 217, in init self.query_key_value = load_attention(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 140, in load_attention return _load_gqa(config, prefix, weights)

File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/custom_modeling/flash_cohere_modeling.py", line 167, in _load_gqa assert list(weight.shape) == [

AssertionError: [44040192, 1] != [8192, 12288] rank=1

Apr 23 '24 16:04 xstraven

same issue 2024-05-03T17:11:35.945462Z INFO text_generation_launcher: Unknown quantization method bitsandbytes

May 03 '24 17:05 backroom-coder

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Jun 03 '24 01:06 github-actions[bot]