text-generation-inference
text-generation-inference copied to clipboard
Using a model of type RefinedWeb to instantiate a model of type .
System Info
Running huggingface/text-generation-inference:0.8.2 on a kubernetes cluster.
2023-06-13T15:28:49.039767Z INFO text_generation_launcher: Runtime environment:
Target: x86_64-unknown-linux-gnu
Cargo version: 1.69.0
Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182
Docker label: sha-e7248fe
nvidia-smi:
Tue Jun 13 15:28:48 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:3B:00.0 Off | 0 |
| N/A 36C P0 37W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... Off | 00000000:D8:00.0 Off | 0 |
| N/A 39C P0 37W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
2023-06-13T15:28:49.039864Z INFO text_generation_launcher: Args { model_id: "/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true }
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [x] An officially supported command
- [ ] My own modifications
Reproduction
Steps to reproduce:
- Run
HF_HUB_ENABLE_HF_TRANSFER=1 text-generation-server download-weights tiiuae/falcon-40blocally - Move the downloaded cache to a tightly sealed kubernetes cluster (to a PVC)
- Move the contents of tiiuae/falcon-40b repository to that PVC as well aside from the weights
- Contents of the folder:
- Create kubernetes resources for running the image. Also mount the volume mentioned above to
/dataI set the following environment variables:
- name: MODEL_ID
value: >-
/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/
- name: QUANTIZE
value: bitsandbytes
- name: DISABLE_CUSTOM_KERNELS
value: 'true'
After this, I get the following logs from the pod:
{"timestamp":"2023-06-13T14:39:19.320908Z","level":"INFO","fields":{"message":"Args { model_id: \"/data/models--tiiuae--falcon-40b/snapshots/2ac60b04625e6694fb6143c00b9f93a01c7a000f/\", revision: None, sharded: None, num_shard: None, quantize: Some(Bitsandbytes), trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: true, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:19.320981Z","level":"INFO","fields":{"message":"Sharding model on 2 processes"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:19.321270Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:21.469258Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-06-13T14:39:22.125552Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:22.126276Z","level":"INFO","fields":{"message":"Starting shard 1"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:22.126295Z","level":"INFO","fields":{"message":"Starting shard 0"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:32.145306Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:32.153372Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:42.168074Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:42.169130Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:52.193505Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:39:52.217776Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:02.266260Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:02.276230Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:12.326939Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:12.349062Z","level":"INFO","fields":{"message":"Waiting for shard 0 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:22.337891Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:32.348584Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:42.359697Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:40:52.370871Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:02.381893Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:12.392017Z","level":"INFO","fields":{"message":"Waiting for shard 1 to be ready..."},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.126054Z","level":"ERROR","fields":{"message":"Shard 0 failed to start:\nYou are using a model of type RefinedWeb to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.126114Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2023-06-13T14:41:22.331289Z","level":"INFO","fields":{"message":"Shard 1 terminated"},"target":"text_generation_launcher"}
Error: ShardCannotStart
Expected behavior
I expect the pod to start successfully
@OlivierDehaene #448
I followed the exact approach with Falcon7B, and it works fine on a single GPU
Hi, just wanted to follow up on this because I believe I'm experiencing a similar issue. Same error, running inference on 2x80gb A100s on Runpod, following this tutorial: https://www.youtube.com/watch?v=FhY8rx_X97k
I got this error trying to run tiiuae/falcon-40b-instruct on two A100 40GB GPUs. I ran it with these options:
singularity run --nv -B $volume:/data ./text-generation-inference_0.8.sif --model-id $model --sharded $sharded --port 8080 &
(singularity is similar to docker, but we don't have docker installed)
It turns out it needed more GPUs. When I increased GPUs from 2 to 4 A100 40GB it started fine and the curl command to test it returned a result. It might work with 3 GPUs - I haven't tested it yet.
I'm not using Kubernetes. I'm on RHEL 8.
I was getting similar issue then I rolled back the docker image to older version and the model started working.
Image where its working: ghcr.io/huggingface/text-generation-inference@sha256:f4e09f01c1dd38bc2e9c9a66e9de1c2e3dc9912c2781440f7ac1eb70f6b1479e
Model: tiiuae/falcon-7b-instruct NUM_SHARD: 1
No quantization. Hardware: 1xA100 20Gi
I have a similar error since v1.1.0 with one A100 80GB GPU when I start TGI with the following environment variables:
- name: MODEL_ID
value: tiiuae/falcon-7b-instruct
- name: QUANTIZE
value: eetq
If I set the quantization to bitsandbytes it works fine. Happens also when using the larger tiiuae/falcon-40b-instruct
Error log:
{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete
standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model
of type . This is not supported for all configurations of models and can yield errors.\nTraceback [...]
@pdeubel, this should only be a warning. Can you provide the whole stacktrace?
Ah yes sorry I actually did not look at the whole stacktrace, seems like eetq is not installed. I run TGI on Kubernetes, i.e. I am using your Docker Image. Perhaps there is something missing regarding the installation of eetq?
Whole stacktrace:
{"timestamp":"2023-09-28T15:26:24.924474Z","level":"INFO","fields":{"message":"Args { model_id: \"tiiuae/falcon-7b-instruct\", revision: None, validation_workers: 2, sharded: None, num_shard: None, quantize: Some(Eetq), dtype: None, trust_remote_code: false, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_top_n_tokens: 5, max_input_length: 1024, max_total_tokens: 2048, waiting_served_ratio: 1.2, max_batch_prefill_tokens: 4096, max_batch_total_tokens: None, max_waiting_tokens: 20, hostname: "...", port: 80, shard_uds_path: \"/tmp/text-generation-server\", master_addr: \"localhost\", master_port: 29500, huggingface_hub_cache: Some(\"/data\"), weights_cache_override: None, disable_custom_kernels: false, cuda_memory_fraction: 1.0, rope_scaling: None, rope_factor: None, json_output: true, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, ngrok: false, ngrok_authtoken: None, ngrok_edge: None, env: false }"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:24.924584Z","level":"INFO","fields":{"message":"Starting download process."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:27.656490Z","level":"INFO","fields":{"message":"Files are already present on the host. Skipping download.\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:28.028217Z","level":"INFO","fields":{"message":"Successfully downloaded weights."},"target":"text_generation_launcher","span":{"name":"download"},"spans":[{"name":"download"}]}
{"timestamp":"2023-09-28T15:26:28.028503Z","level":"INFO","fields":{"message":"Starting shard"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2023-09-28T15:26:37.383475Z","level":"ERROR","fields":{"message":"Error when initializing model\nTraceback (most recent call last):\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 311, in __call__\n return get_command(self)(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1157, in __call__\n return self.main(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 778, in main\n return _main(\n File \"/opt/conda/lib/python3.9/site-packages/typer/core.py\", line 216, in _main\n rv = self.invoke(ctx)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1688, in invoke\n return _process_result(sub_ctx.command.invoke(sub_ctx))\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 1434, in invoke\n return ctx.invoke(self.callback, **ctx.params)\n File \"/opt/conda/lib/python3.9/site-packages/click/core.py\", line 783, in invoke\n return __callback(*args, **kwargs)\n File \"/opt/conda/lib/python3.9/site-packages/typer/main.py\", line 683, in wrapper\n return callback(**use_params) # type: ignore\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n server.serve(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n asyncio.run(\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 634, in run_until_complete\n self.run_forever()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 601, in run_forever\n self._run_once()\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 1905, in _run_once\n handle._run()\n File \"/opt/conda/lib/python3.9/asyncio/events.py\", line 80, in _run\n self._context.run(self._callback, *self._args)\n> File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n model = get_model(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n return FlashRWSharded(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n [\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n FlashRWLayer(layer_id, config, weights)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n self.self_attention = FlashRWAttention(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n self.query_key_value = TensorParallelColumnLinear.load(\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n return cls.load_multi(config, [prefix], weights, bias, dim=0)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n linear = get_linear(weight, bias, config.quantize)\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n raise ImportError(\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.040585Z","level":"ERROR","fields":{"message":"Shard complete standard error output:\n\nYou are using a model of type RefinedWebModel to instantiate a model of type . This is not supported for all configurations of models and can yield errors.\nTraceback (most recent call last):\n\n File \"/opt/conda/bin/text-generation-server\", line 8, in <module>\n sys.exit(app())\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py\", line 83, in serve\n server.serve(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 207, in serve\n asyncio.run(\n\n File \"/opt/conda/lib/python3.9/asyncio/runners.py\", line 44, in run\n return loop.run_until_complete(main)\n\n File \"/opt/conda/lib/python3.9/asyncio/base_events.py\", line 647, in run_until_complete\n return future.result()\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py\", line 159, in serve_inner\n model = get_model(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/__init__.py\", line 234, in get_model\n return FlashRWSharded(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/flash_rw.py\", line 67, in __init__\n model = FlashRWForCausalLM(config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 625, in __init__\n self.transformer = FlashRWModel(config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 567, in __init__\n [\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 568, in <listcomp>\n FlashRWLayer(layer_id, config, weights)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 396, in __init__\n self.self_attention = FlashRWAttention(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/models/custom_modeling/flash_rw_modeling.py\", line 147, in __init__\n self.query_key_value = TensorParallelColumnLinear.load(\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 436, in load\n return cls.load_multi(config, [prefix], weights, bias, dim=0)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 449, in load_multi\n linear = get_linear(weight, bias, config.quantize)\n\n File \"/opt/conda/lib/python3.9/site-packages/text_generation_server/utils/layers.py\", line 280, in get_linear\n raise ImportError(\n\nImportError: Please install EETQ from https://github.com/NetEase-FuXi/EETQ\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
Error: ShardCannotStart
{"timestamp":"2023-09-28T15:26:38.137802Z","level":"ERROR","fields":{"message":"Shard 0 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2023-09-28T15:26:38.137840Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
EETQ is missing from the docker image, my bad on this: https://github.com/huggingface/text-generation-inference/pull/1081
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.