text-generation-inference make install-server does not have Apple MacOS Metal Framework

System Info

make install-server does not have Apple MacOS Metal Framework

Please either remove from the readme info about brew/macOS altogether to not confuse users.
OR add support for Apple MPS framework to the ./server folder so the make command can install on macOS.

Information

[ ] Docker
[X] The CLI directly

Tasks

[X] An officially supported command
[ ] My own modifications

Reproduction

#on macOS M-series: make install-server

Tries to install flash attention with NVIDIA CUDA

Expected behavior

make install-server should be optimized for Apple Metal Framework.

Jan 08 '25 17:01 qdrddr

Would be great to see this shipped with macOS's brew @ankane maybe you can do the same magic for Huggingface's text-generation-inference (TGI) too like you did for TEI?

Feb 07 '25 21:02 qdrddr

Here's a formula that works with the main branch (brew install text-generation-inference --head):

class TextGenerationInference < Formula
  include Language::Python::Virtualenv

  desc "Large Language Model Text Generation Inference"
  homepage "https://hf.co/docs/text-generation-inference"
  url "https://github.com/huggingface/text-generation-inference/archive/refs/tags/v3.1.0.tar.gz"
  sha256 "26b3844e03b089678901c67d639c1a97effdc8bd6e3a361bba709b1695edc573"
  license "Apache-2.0"
  head "https://github.com/huggingface/text-generation-inference.git", branch: "main"

  depends_on "cmake" => :build
  depends_on "rust" => :build
  depends_on "uv" => :build
  depends_on "protobuf"
  depends_on "[email protected]"

  def install
    system "cargo", "install", *std_cargo_args(path: "backends/v3")
    system "cargo", "install", *std_cargo_args(path: "launcher")

    # prevent error with outlines installation due to location of uv cache
    rm "Cargo.toml"

    venv = virtualenv_create(libexec, "python3.13", system_site_packages: false)
    ENV["VIRTUAL_ENV"] = venv.root

    uv = Formula["uv"].opt_bin/"uv"
    cd "server" do
      system uv, "run", "--active", "--extra", "gen", "--", "make", "gen-server-raw"
      system uv, "pip", "install", ".[accelerate,compressed-tensors,quantize,peft,outlines]"
    end
    bin.install_symlink libexec/"bin/text-generation-server"
  end

  test do
    port = free_port
    fork do
      exec bin/"text-generation-launcher", "-p", port.to_s
    end

    data = "{\"inputs\":\"What is Deep Learning?\",\"parameters\":{\"max_new_tokens\":1}}}"
    header = "Content-Type: application/json"
    retries = "--retry 10 --retry-connrefused"
    assert_match "generated_text", shell_output("curl -s 127.0.0.1:#{port}/generate_stream -X POST -d '#{data}' -H '#{header}' #{retries}")
  end
end

There's a linkage error (that doesn't seem to affect the server) and the test downloads a lot of data (> 1 GB), but these can probably be addressed.

Feb 08 '25 00:02 ankane

after brew install text-generation-inference --head Also text-generation-launcher is missing. FYI I have macoOS 15 with M1/Apple Silicone @ankane

/opt/anaconda3/bin/text-generation-server --help
2025-02-10 10:24:55.396 | INFO     | text_generation_server.utils.import_utils:<module>:80 - Detected system cpu
Traceback (most recent call last):
  File "/opt/anaconda3/bin/text-generation-server", line 5, in <module>
    from text_generation_server.cli import app
  File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/cli.py", line 10, in <module>
    from text_generation_server.utils.adapter import parse_lora_adapters
  File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/utils/__init__.py", line 13, in <module>
    from text_generation_server.utils.tokens import (
  File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/utils/tokens.py", line 5, in <module>
    from text_generation_server.pb import generate_pb2
  File "/Volumes/OWCExpress1M2/Users/dberezenko/git/text-generation-inference/server/text_generation_server/pb/generate_pb2.py", line 12, in <module>
    _runtime_version.ValidateProtobufRuntimeVersion(
  File "/opt/anaconda3/lib/python3.12/site-packages/google/protobuf/runtime_version.py", line 106, in ValidateProtobufRuntimeVersion
    _ReportVersionError(
  File "/opt/anaconda3/lib/python3.12/site-packages/google/protobuf/runtime_version.py", line 47, in _ReportVersionError
    raise VersionError(msg)
google.protobuf.runtime_version.VersionError: Detected incompatible Protobuf Gencode/Runtime versions when loading generate.proto: gencode 5.29.0 runtime 5.28.3. Runtime version cannot be older than the linked gencode version. See Protobuf version guarantees at https://protobuf.dev/support/cross-version-runtime-guarantee.

Feb 10 '25 16:02 qdrddr

It looks like you're running a different installation of TGI (/opt/anaconda3/bin/text-generation-server isn't the Homebrew install).

Feb 10 '25 19:02 ankane

brew install text-generation-inference --head
Warning: No available formula with the name "text-generation-inference". Did you mean text-embeddings-inference?

@ankane

Feb 14 '25 20:02 qdrddr

Would that be of any help the LM Studio has implemented MLX. And here is Anemll ANE library to work with MLX it is MIT Licensed. And there's FastMLX with an Apache 2.0 license.

Feb 20 '25 17:02 qdrddr

FYI there's a Metal Flash Attention available here

Mar 05 '25 16:03 qdrddr

fwiw, I submitted 3.1.1 to Homebrew, but couldn't get it fully working: https://github.com/Homebrew/homebrew-core/pull/209731

Someone is welcome to resubmit if they can figure it out.

Mar 05 '25 18:03 ankane

PoC of Metal Flash Attention with Python, C, Rust bindings for non-MLX models on Apple Silicon.

https://github.com/bghira/universal-metal-flash-attention

Sep 20 '25 18:09 qdrddr