Add llama.cpp local inference support for Mac/local users

Open chindris-mihai-alexandru opened this issue 4 months ago • 0 comments

Summary

Add support for running DeepResearch 100% locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.

Why This?

The main inference path requires vLLM with 8x A100 GPUs. This PR adds an alternative for:

Mac users (M1/M2/M3/M4 with Metal acceleration)
Local/privacy-focused users
Developers who want to experiment without GPU server access
Anyone who wants free, offline research capabilities

New Files

File	Description
`inference/interactive_llamacpp.py`	ReAct agent CLI that connects to llama.cpp server
`scripts/start_llama_server.sh`	Server startup script with optimized Metal settings
`requirements-local.txt`	Minimal deps: `requests`, `duckduckgo-search`, `python-dotenv`

Features

Free web search: Uses DuckDuckGo (no API key required)
Page visiting: Uses Jina Reader (optional API key for better results)
Loop detection: Prevents infinite tool call cycles (3 consecutive errors → force answer)
32K context: Long research sessions supported
Rate limit handling: Exponential backoff retry for DuckDuckGo
URL validation: Validates URLs before attempting to visit

Requirements

llama.cpp built with Metal (-DLLAMA_METAL=ON) or CUDA support
GGUF model from bartowski
32GB+ RAM for Q4_K_M quantization (~18GB model)

Quick Start

# Install minimal dependencies
pip install -r requirements-local.txt

# Terminal 1: Start the server
./scripts/start_llama_server.sh

# Terminal 2: Run research queries
python inference/interactive_llamacpp.py

Testing

Tested on Apple M1 Max with 32GB RAM:

Model loads in ~30-60 seconds
Inference runs at ~10-15 tokens/sec
Tool calls (search, visit) work correctly
Loop detection prevents runaway tool calls

This is a cleaner alternative to PR #220 (MLX support), which had issues with chat template handling. llama.cpp is more mature and widely used.

Nov 28 '25 14:11 chindris-mihai-alexandru

Add llama.cpp local inference support for Mac/local users

Summary

Why This?

New Files

Features

Requirements

Quick Start

Testing

Related