make install-server does not have Apple MacOS Metal Framework
System Info
make install-server does not have Apple MacOS Metal Framework
- Please either remove from the readme info about brew/macOS altogether to not confuse users.
- OR add support for Apple MPS framework to the ./server folder so the make command can install on macOS.
Information
- [ ] Docker
- [X] The CLI directly
Tasks
- [X] An officially supported command
- [ ] My own modifications
Reproduction
#on macOS M-series:
make install-server
Tries to install flash attention with NVIDIA CUDA
Expected behavior
make install-server should be optimized for Apple Metal Framework.
Would be great to see this shipped with macOS's brew
@ankane maybe you can do the same magic for Huggingface's text-generation-inference (TGI) too like you did for TEI?
Here's a formula that works with the main branch (brew install text-generation-inference --head):
class TextGenerationInference < Formula
include Language::Python::Virtualenv
desc "Large Language Model Text Generation Inference"
homepage "https://hf.co/docs/text-generation-inference"
url "https://github.com/huggingface/text-generation-inference/archive/refs/tags/v3.1.0.tar.gz"
sha256 "26b3844e03b089678901c67d639c1a97effdc8bd6e3a361bba709b1695edc573"
license "Apache-2.0"
head "https://github.com/huggingface/text-generation-inference.git", branch: "main"
depends_on "cmake" => :build
depends_on "rust" => :build
depends_on "uv" => :build
depends_on "protobuf"
depends_on "[email protected]"
def install
system "cargo", "install", *std_cargo_args(path: "backends/v3")
system "cargo", "install", *std_cargo_args(path: "launcher")
# prevent error with outlines installation due to location of uv cache
rm "Cargo.toml"
venv = virtualenv_create(libexec, "python3.13", system_site_packages: false)
ENV["VIRTUAL_ENV"] = venv.root
uv = Formula["uv"].opt_bin/"uv"
cd "server" do
system uv, "run", "--active", "--extra", "gen", "--", "make", "gen-server-raw"
system uv, "pip", "install", ".[accelerate,compressed-tensors,quantize,peft,outlines]"
end
bin.install_symlink libexec/"bin/text-generation-server"
end
test do
port = free_port
fork do
exec bin/"text-generation-launcher", "-p", port.to_s
end
data = "{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":1}}}"
header = "Content-Type: application/json"
retries = "--retry 10 --retry-connrefused"
assert_match "generated_text", shell_output("curl -s 127.0.0.1:#{port}/generate_stream -X POST -d '#{data}' -H '#{header}' #{retries}")
end
end
There's a linkage error (that doesn't seem to affect the server) and the test downloads a lot of data (> 1 GB), but these can probably be addressed.
after brew install text-generation-inference --head
Also text-generation-launcher is missing. FYI I have macoOS 15 with M1/Apple Silicone @ankane
/opt/anaconda3/bin/text-generation-server --help
2025-02-10 10:24:55.396 | INFO | text_generation_server.utils.import_utils:<module>:80 - Detected system cpu
Traceback (most recent call last):
File "/opt/anaconda3/bin/text-generation-server", line 5, in <module>
from text_generation_server.cli import app
File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/cli.py", line 10, in <module>
from text_generation_server.utils.adapter import parse_lora_adapters
File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/utils/__init__.py", line 13, in <module>
from text_generation_server.utils.tokens import (
File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/utils/tokens.py", line 5, in <module>
from text_generation_server.pb import generate_pb2
File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/pb/generate_pb2.py", line 12, in <module>
_runtime_version.ValidateProtobufRuntimeVersion(
File "/opt/anaconda3/lib/python3.12/site-packages/google/protobuf/runtime_version.py", line 106, in ValidateProtobufRuntimeVersion
_ReportVersionError(
File "/opt/anaconda3/lib/python3.12/site-packages/google/protobuf/runtime_version.py", line 47, in _ReportVersionError
raise VersionError(msg)
google.protobuf.runtime_version.VersionError: Detected incompatible Protobuf Gencode/Runtime versions when loading generate.proto: gencode 5.29.0 runtime 5.28.3. Runtime version cannot be older than the linked gencode version. See Protobuf version guarantees at https://protobuf.dev/support/cross-version-runtime-guarantee.
It looks like you're running a different installation of TGI (/opt/anaconda3/bin/text-generation-server isn't the Homebrew install).
brew install text-generation-inference --head
Warning: No available formula with the name "text-generation-inference". Did you mean text-embeddings-inference?
@ankane
Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.
FYI there's a Metal Flash Attention available here
fwiw, I submitted 3.1.1 to Homebrew, but couldn't get it fully working: https://github.com/Homebrew/homebrew-core/pull/209731
Someone is welcome to resubmit if they can figure it out.
PoC of Metal Flash Attention with Python, C, Rust bindings for non-MLX models on Apple Silicon.
https://github.com/bghira/universal-metal-flash-attention