text-generation-inference NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Falcon-40b-instruct

System Info

2023-06-15T16:56:34.095240Z INFO text_generation_launcher: Runtime environment: Target: x86_64-unknown-linux-gnu Cargo version: 1.69.0 Commit sha: e7248fe90e27c7c8e39dd4cac5874eb9f96ab182 Docker label: sha-e7248fe nvidia-smi: Thu Jun 15 16:56:34 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A | | 0% 28C P8 31W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:05:00.0 Off | N/A | | 0% 29C P8 29W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:85:00.0 Off | N/A | | 0% 32C P8 39W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:89:00.0 Off | N/A | | 0% 31C P8 35W / 350W | 5MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ 2023-06-15T16:56:34.095280Z INFO text_generation_launcher: Args { model_id: "tiiuae/falcon-40b-instruct", revision: None, sharded: Some(true), num_shard: Some(4), quantize: Some(Bitsandbytes), trust_remote_code: true, max_concurrent_requests: 128, max_best_of: 2, max_stop_sequences: 4, max_input_length: 1000, max_total_tokens: 1512, max_batch_size: None, waiting_served_ratio: 1.2, max_batch_total_tokens: 32000, max_waiting_tokens: 20, port: 80, shard_uds_path: "/tmp/text-generation-server", master_addr: "localhost", master_port: 29500, huggingface_hub_cache: Some("/data"), weights_cache_override: None, disable_custom_kernels: true, json_output: false, otlp_endpoint: None, cors_allow_origin: [], watermark_gamma: None, watermark_delta: None, env: true } 2023-06-15T16:56:34.095307Z INFO text_generation_launcher: Sharding model on 4 processes 2023-06-15T16:56:34.095499Z INFO text_generation_launcher: Starting download process. 2023-06-15T16:56:37.249246Z INFO download: text_generation_launcher: Files are already present on the host. Skipping download.

2023-06-15T16:56:37.800552Z INFO text_generation_launcher: Successfully downloaded weights. 2023-06-15T16:56:37.800598Z WARN text_generation_launcher: trust_remote_code is set. Trusting that model tiiuae/falcon-40b-instruct do not contain malicious code. 2023-06-15T16:56:37.800609Z WARN text_generation_launcher: Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. 2023-06-15T16:56:37.800830Z INFO text_generation_launcher: Starting shard 0 2023-06-15T16:56:37.800944Z INFO text_generation_launcher: Starting shard 1 2023-06-15T16:56:37.801361Z INFO text_generation_launcher: Starting shard 2 2023-06-15T16:56:37.801377Z INFO text_generation_launcher: Starting shard 3 2023-06-15T16:56:41.497407Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError( NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with cd server && make install install-flash-attention rank=0 2023-06-15T16:56:41.854643Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError( NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with cd server && make install install-flash-attention rank=1 2023-06-15T16:56:41.857280Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError( NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with cd server && make install install-flash-attention rank=2 2023-06-15T16:56:41.871194Z ERROR shard-manager: text_generation_launcher: Error when initializing model Traceback (most recent call last): File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app()) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1130, in call return self.main(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "/opt/conda/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1657, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 1404, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/conda/lib/python3.9/site-packages/click/core.py", line 760, in invoke return __callback(*args, **kwargs) File "/opt/conda/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code)) File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/opt/conda/lib/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, *self._args) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code) File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError( NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with cd server && make install install-flash-attention rank=3 2023-06-15T16:56:42.105200Z ERROR text_generation_launcher: Shard 0 failed to start: /opt/conda/lib/python3.9/site-packages/torch/cuda/init.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. Traceback (most recent call last):

File "/opt/conda/bin/text-generation-server", line 8, in sys.exit(app())

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/cli.py", line 67, in serve server.serve(model_id, revision, sharded, quantize, trust_remote_code, uds_path)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 155, in serve asyncio.run(serve_inner(model_id, revision, sharded, quantize, trust_remote_code))

File "/opt/conda/lib/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/opt/conda/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result()

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/server.py", line 124, in serve_inner model = get_model(model_id, revision, sharded, quantize, trust_remote_code)

File "/opt/conda/lib/python3.9/site-packages/text_generation_server/models/init.py", line 215, in get_model raise NotImplementedError(

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with cd server && make install install-flash-attention

2023-06-15T16:56:42.105238Z INFO text_generation_launcher: Shutting down shards 2023-06-15T16:56:42.139795Z INFO text_generation_launcher: Shard 1 terminated 2023-06-15T16:56:42.139858Z INFO text_generation_launcher: Shard 2 terminated 2023-06-15T16:56:42.140221Z INFO text_generation_launcher: Shard 3 terminated

I have tried latest docker as well same issue.

Information

[X] Docker
[ ] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

docker run --gpus all --shm-size 1g -p 8080:80 -v /data/cache/hub:/data ghcr.io/huggingface/text-generation-inference:0.8 --model-id tiiuae/falcon-40b-instruct --sharded true --num-shard 4 --env --trust-remote-code --quantize bitsandbytes --disable-custom-kernels

Expected behavior

Starting Falcon-40b-instruct tgi

Jun 15 '23 17:06 shshnk158

If i run on one shard im getting this error:

2023-06-15T17:59:57.832269Z INFO text_generation_launcher: Waiting for shard 0 to be ready... 2023-06-15T18:00:08.813178Z INFO text_generation_launcher: Waiting for shard 0 to be ready... 2023-06-15T18:00:12.813387Z ERROR text_generation_launcher: Shard 0 failed to start: /opt/conda/lib/python3.9/site-packages/torch/cuda/__init__.py:107: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at /opt/conda/conda-bld/pytorch_1678402412426/work/c10/cuda/CUDAFunctions.cpp:109.) return torch._C._cuda_getDeviceCount() > 0 /opt/conda/lib/python3.9/site-packages/bitsandbytes/cextension.py:33: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.

Jun 15 '23 18:06 shshnk158

I am getting the same error for falcon-7b

Command docker run --gpus all --shm-size 1g -p 9010:80 -v $PWD/data:/data ghcr.io/huggingface/text-generation-inference:latest --model-id tiiuae/falcon-7b --num-shard 2

Error NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Use the official Docker image (ghcr.io/huggingface/text-generation-inference:latest) or install flash attention with `cd server && make install install-flash-attention rank=1

Expected behavior Starting Falcon-7b tgi

Jun 16 '23 02:06 collarad

Same error here, and when I try to build flash attention, it takes days litteraly...

Jun 16 '23 08:06 jgcb00

In your stacktrace:

UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

Is torch.cuda.is_available() working outside of the container? it seems that there is something wrong with your drivers.

Jun 16 '23 14:06 OlivierDehaene

Hey @OlivierDehaene

Python 3.10.8 (main, Nov 24 2022, 14:13:03) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

Its True, torch is able to recognize cuda,

Jun 16 '23 15:06 shshnk158

resolved the issue by uninstalling drivers and installing again,

installed NVIDIA Container Toolkit https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install- guide.html#setting-up-nvidia-container-toolkit

configured path variables in bashrc

echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda/' >> ~/.bashrc
echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc

and rebooted the system. it works.

falcon 40b instruct 8bit quantized version takes 170ms per token

Jun 20 '23 02:06 shshnk158

text-generation-inference text-generation-inference copied to clipboard

NotImplementedError: Sharded RefinedWeb requires Flash Attention CUDA kernels to be installed. Falcon-40b-instruct

System Info

Information

Tasks

Reproduction

Expected behavior

text-generation-inference
text-generation-inference copied to clipboard