lorax
lorax copied to clipboard
Can't run Mistral quantized on T4
System Info
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla T4 Off | 00000001:00:00.0 Off | Off |
| N/A 28C P0 24W / 70W | 0MiB / 16384MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
docker run --runtime nvidia --gpus all --ipc=host -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/predibase/lorax:latest \
--model-id mistralai/Mistral-7B-v0.1 \
--quantize bitsandbytes-nf4
Information
- [X] Docker
- [ ] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
I'm simply trying to run mistralai/Mistral-7B-v0.1 with 4-bit quantization on my T4 with bitsandbytes-nf4! However it errors with 'Mistral model requires flash attn v2'?
2024-04-16T14:50:32.809986Z INFO download: lorax_launcher: Successfully downloaded weights.
2024-04-16T14:50:32.810173Z INFO shard-manager: lorax_launcher: Starting shard rank=0
2024-04-16T14:50:38.132469Z WARN lorax_launcher: flash_attn.py:48 Unable to use Flash Attention V2: GPU with CUDA capability 7 5 is not supported for Flash Attention V2
2024-04-16T14:50:38.267395Z ERROR lorax_launcher: server.py:271 Error when initializing model
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 311, in __call__
return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 778, in main
return _main(
File "/opt/conda/lib/python3.10/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/typer/main.py", line 683, in wrapper
return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
self._run_once()
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
handle._run()
File "/opt/conda/lib/python3.10/asyncio/events.py", line 80, in _run
self._context.run(self._callback, *self._args)
> File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 267, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 10, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 49, in <module>
raise ImportError("Mistral model requires flash attn v2")
ImportError: Mistral model requires flash attn v2
2024-04-16T14:50:39.215837Z ERROR shard-manager: lorax_launcher: Shard complete standard error output:
Traceback (most recent call last):
File "/opt/conda/bin/lorax-server", line 8, in <module>
sys.exit(app())
File "/opt/conda/lib/python3.10/site-packages/lorax_server/cli.py", line 89, in serve
server.serve(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 321, in serve
asyncio.run(
File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/opt/conda/lib/python3.10/site-packages/lorax_server/server.py", line 267, in serve_inner
model = get_model(
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/__init__.py", line 179, in get_model
from lorax_server.models.flash_mistral import FlashMistral
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/flash_mistral.py", line 10, in <module>
from lorax_server.models.custom_modeling.flash_mistral_modeling import (
File "/opt/conda/lib/python3.10/site-packages/lorax_server/models/custom_modeling/flash_mistral_modeling.py", line 49, in <module>
raise ImportError("Mistral model requires flash attn v2")
ImportError: Mistral model requires flash attn v2
rank=0
2024-04-16T14:50:39.314620Z ERROR lorax_launcher: Shard 0 failed to start
2024-04-16T14:50:39.314636Z INFO lorax_launcher: Shutting down shards
Expected behavior
The model to load and run!?