DeepResearch icon indicating copy to clipboard operation
DeepResearch copied to clipboard

Add llama.cpp local inference support for Mac/local users

Open chindris-mihai-alexandru opened this issue 4 months ago • 0 comments

Summary

Add support for running DeepResearch 100% locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.

Why This?

The main inference path requires vLLM with 8x A100 GPUs. This PR adds an alternative for:

  • Mac users (M1/M2/M3/M4 with Metal acceleration)
  • Local/privacy-focused users
  • Developers who want to experiment without GPU server access
  • Anyone who wants free, offline research capabilities

New Files

File Description
inference/interactive_llamacpp.py ReAct agent CLI that connects to llama.cpp server
scripts/start_llama_server.sh Server startup script with optimized Metal settings
requirements-local.txt Minimal deps: requests, duckduckgo-search, python-dotenv

Features

  • Free web search: Uses DuckDuckGo (no API key required)
  • Page visiting: Uses Jina Reader (optional API key for better results)
  • Loop detection: Prevents infinite tool call cycles (3 consecutive errors → force answer)
  • 32K context: Long research sessions supported
  • Rate limit handling: Exponential backoff retry for DuckDuckGo
  • URL validation: Validates URLs before attempting to visit

Requirements

  • llama.cpp built with Metal (-DLLAMA_METAL=ON) or CUDA support
  • GGUF model from bartowski
  • 32GB+ RAM for Q4_K_M quantization (~18GB model)

Quick Start

# Install minimal dependencies
pip install -r requirements-local.txt

# Terminal 1: Start the server
./scripts/start_llama_server.sh

# Terminal 2: Run research queries
python inference/interactive_llamacpp.py

Testing

Tested on Apple M1 Max with 32GB RAM:

  • Model loads in ~30-60 seconds
  • Inference runs at ~10-15 tokens/sec
  • Tool calls (search, visit) work correctly
  • Loop detection prevents runaway tool calls

Related

This is a cleaner alternative to PR #220 (MLX support), which had issues with chat template handling. llama.cpp is more mature and widely used.