llama-stack-apps
llama-stack-apps copied to clipboard
Step by step instructions to install and run the Llama Stack on Linux and Mac
I managed to make the Llama Stack server and client work with Ollama on both EC2 (with 24GB GPU) and Mac (tested on 2021 M1 and 2019 2.4GHz i9 MBP, both with 32GB memory). Steps are below:
- Open one Terminal, go to your work directory, then:
git clone https://github.com/meta-llama/llama-agentic-system
cd llama-agentic-system
conda create -n llama-stack python=3.10
conda activate llama-stack
pip install -r requirements.txt
- If you're on Linux, run:
curl -fsSL https://ollama.com/install.sh | sh
Otherwise, download the Ollama zip for Mac here, unzip it and double click the Ollama.app to move it to the Applications folder.
- On the same Terminal, run:
ollama pull llama3.1:8b-instruct-fp16
to download the Llama 3.1 8B model and then run:
ollama run llama3.1:8b-instruct-fp16
to confirm it works by entering some question and expecting Llama's answer.
- Now run the command below to install Llama Stack's Ollama distribution:
llama distribution install --spec local-ollama --name ollama
You should see (and hit enter to accept default settings for Configuring..., except n & n for the two questions related to llama_guard_shield & prompt_guard_shield):
Successfully setup distribution environment. Configuring... Configuring API surface: inference Enter value for url (default: http://localhost:11434):
Configuring API surface: safety Do you want to configure llama_guard_shield? (y/n): n Do you want to configure prompt_guard_shield? (y/n): n
Configuring API surface: agentic_system
YAML configuration has been written to /Users/<your_name>/.llama/distributions/ollama/config.yaml Distribution
ollama
(with spec local-ollama) has been installed successfully!
- Launch the ollama distribution by running:
llama distribution start --name ollama --port 5000
- Finally on another Terminal, go to the
llama-agentic-system
folder, then:
conda activate ollama
and either (on Mac)
python examples/scripts/vacation.py localhost 5000 --disable_safety
or (on Linux)
python examples/scripts/vacation.py [::] 5000 --disable_safety
You should see output starting with (Note: If you start the script right after Step 5, especially on a slower machine such as 2019 Mac with 2.4GHz i9, you may see "httpcore.ReadTimeout" because the Llama model is still being loaded; wait a moment and retry (a few times) should work):
User> I am planning a trip to Switzerland, what are the top 3 places to visit? StepType.inference> Switzerland is a beautiful country with a rich history, stunning landscapes, and vibrant culture. Here are three top places to visit in Switzerland:
- Jungfraujoch: Also known as the "Top of Europe," Jungfraujoch is the highest train station in Europe, located at an altitude of 3,454 meters (11,332 feet) above sea level. It offers breathtaking views of the surrounding mountains and glaciers, including the iconic Eiger, Mönch, and Jungfrau peaks.
and on the first Terminal that runs llama distribution start --name ollama --port 5000
, you should see:
INFO: Uvicorn running on http://[::]:5000 (Press CTRL+C to quit) Environment: ipython Tools: brave_search, wolfram_alpha, photogen
Cutting Knowledge Date: December 2023 Today Date: 09 August 2024
INFO: ::1:50987 - "POST /agentic_system/create HTTP/1.1" 200 OK INFO: ::1:50988 - "POST /agentic_system/session/create HTTP/1.1" 200 OK INFO: ::1:50989 - "POST /agentic_system/turn/create HTTP/1.1" 200 OK role='user' content='I am planning a trip to Switzerland, what are the top 3 places to visit?' Pulling model: llama3.1:8b-instruct-fp16 Assistant: Switzerland is a beautiful country with a rich history, stunning landscapes, and vibrant culture. Here are three top places to visit in Switzerland:
- Jungfraujoch: Also known as the "Top of Europe," Jungfraujoch is a mountain peak located in the Bernese Alps. It's the highest train station in Europe, offering breathtaking views of the surrounding mountains, glaciers, and valleys. You can take a ride on the Jungfrau Railway, which takes you to the summit, where you can enjoy stunning vistas, visit the Ice Palace, and even ski or snowboard in the winter.
Bonus: To see the tool calling (see here and here for more info) in action, try the hello.py
example, which asks Llama "Which players played in the winning team of the NBA western conference semifinals of 2024, please use tools" whose answer needs a web search tool, followed by a prompt "Hello". On Mac, run (replace localhost
with [::]
on Linux):
python examples/scripts/hello.py localhost 5000 --disable_safety
And you should see the output returning "BuiltinTool.brave_search" below (if you see "httpcore.ReadTimeout", retry should work):
User> Hello StepType.inference> Hello! How can I assist you today? User> Which players played in the winning team of the NBA western conference semifinals of 2024, please use tools StepType.inference> brave_search.call(query="NBA Western Conference Semifinals 2024 winning team players") StepType.tool_execution> Tool:BuiltinTool.brave_search Args:{'query': 'NBA Western Conference Semifinals 2024 winning team players'} StepType.tool_execution> Tool:BuiltinTool.brave_search Response:{"query": null, "top_k": []} StepType.shield_call> No Violation StepType.inference> I need to search for information about the 2024 NBA Western Conference Semifinals.
If you delete "please use tools" in the prompt of hello.py
, not wanting to beg, you'll likely see the output:
I'm not able to provide real-time information. However, I can suggest some possible sources where you may be able to find the information you are looking for.
By setting an appropriate system prompt, or switching to a bigger sized Llama 3.1 model - details coming soon - you'd see you don't have to be too polite to make Llama comfortable but yourself not.