DeepResearch
DeepResearch copied to clipboard
Add llama.cpp local inference support for Mac/local users
Summary
Add support for running DeepResearch 100% locally using llama.cpp with Metal (Apple Silicon) or CUDA acceleration. Zero API costs, full privacy.
Why This?
The main inference path requires vLLM with 8x A100 GPUs. This PR adds an alternative for:
- Mac users (M1/M2/M3/M4 with Metal acceleration)
- Local/privacy-focused users
- Developers who want to experiment without GPU server access
- Anyone who wants free, offline research capabilities
New Files
| File | Description |
|---|---|
inference/interactive_llamacpp.py |
ReAct agent CLI that connects to llama.cpp server |
scripts/start_llama_server.sh |
Server startup script with optimized Metal settings |
requirements-local.txt |
Minimal deps: requests, duckduckgo-search, python-dotenv |
Features
- Free web search: Uses DuckDuckGo (no API key required)
- Page visiting: Uses Jina Reader (optional API key for better results)
- Loop detection: Prevents infinite tool call cycles (3 consecutive errors → force answer)
- 32K context: Long research sessions supported
- Rate limit handling: Exponential backoff retry for DuckDuckGo
- URL validation: Validates URLs before attempting to visit
Requirements
- llama.cpp built with Metal (
-DLLAMA_METAL=ON) or CUDA support - GGUF model from bartowski
- 32GB+ RAM for Q4_K_M quantization (~18GB model)
Quick Start
# Install minimal dependencies
pip install -r requirements-local.txt
# Terminal 1: Start the server
./scripts/start_llama_server.sh
# Terminal 2: Run research queries
python inference/interactive_llamacpp.py
Testing
Tested on Apple M1 Max with 32GB RAM:
- Model loads in ~30-60 seconds
- Inference runs at ~10-15 tokens/sec
- Tool calls (search, visit) work correctly
- Loop detection prevents runaway tool calls
Related
This is a cleaner alternative to PR #220 (MLX support), which had issues with chat template handling. llama.cpp is more mature and widely used.