qwen2.5:32b running locally on a decent Mac or mini PC now beats GPT-3.5 on most developer tasks. It’s free, offline, and every token stays on your hardware. Self-hosting your AI stack stopped being a nerd flex about six months ago. Now it’s just good engineering.
Here’s the stack, in the order you should actually deploy it.
Why Self-Host in 2026?
Three reasons that have gotten more compelling, not less:
Cost. If you’re running agents at any volume — a few hundred calls a day — API costs add up fast. A $400 mini PC running Qwen 2.5 or Llama 3.3 pays for itself in a few weeks against Sonnet-tier API pricing.
Privacy. Code, documents, emails, internal data — none of it leaves your machine. For anything client-facing or sensitive, this isn’t optional.
Reliability. Your homelab doesn’t have Anthropic’s uptime, but it also doesn’t have their outages. For workflows where you control the hardware, self-hosted models have zero rate limits and no 429s.
The tradeoff is real: local models aren’t as capable as Claude Opus or GPT-4o for complex reasoning tasks. The sweet spot is using local models for high-volume, lower-stakes work and the Claude/OpenAI API for tasks where frontier intelligence actually matters.
The Stack (in order)
1. Ollama — Local Model Engine
Ollama is the foundation. It handles model downloading, quantization, and serving — you get a local REST API that’s OpenAI-compatible out of the box.
macOS / Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download the installer from ollama.com/download.
Pull and run a model:
ollama run qwen2.5:32b # Best reasoning per dollar right now
ollama run llama3.2:3b # Fast, low RAM (good for classification)
ollama run phi4 # Microsoft's compact reasoning model
ollama run gemma3:27b # Google's latest open model
Ollama exposes a local API at http://localhost:11434. It’s OpenAI-compatible, meaning anything built for OpenAI’s SDK works with Ollama with a one-line URL change.
Hardware minimums:
- 7B models: 8GB RAM (runs on M-series Macs, most modern laptops)
- 32B models: 32GB RAM (Mac Studio, mini PCs with upgraded RAM)
- 70B+ models: 64GB+ (dedicated server or multi-GPU setup)
For most developers: a Mac mini M4 Pro with 48GB unified memory runs 32B models comfortably and costs ~$1,400. ROI against API costs is under 2 months at any serious usage volume.
2. Open WebUI — The Interface
Open WebUI turns Ollama into something your whole team can use. ChatGPT-style interface, document uploads, RAG, tool calling, multi-user auth, conversation history. All local.
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
ghcr.io/open-webui/open-webui:main
Browse to http://localhost:3000. On first load, create an admin account.
Open WebUI auto-detects Ollama if they’re on the same machine. You can also connect it to OpenAI, Anthropic, or any OpenAI-compatible endpoint — meaning you can switch between local and cloud models from the same interface.
What sets it apart from other frontends: it supports tool calling for models that handle it (Qwen 2.5, Llama 3.x, Mistral), document collections for RAG, and image generation if you have ComfyUI or Automatic1111 running. It’s not a toy — production teams run it.
3. n8n — The Automation Brain
n8n is where local AI stops being a chat interface and becomes a workflow engine. Think Zapier but self-hosted, with LLM nodes built in.
docker run -d -p 5678:5678 \
-v n8n_data:/home/node/.n8n \
--name n8n \
n8nio/n8n
Browse to http://localhost:5678.
The AI Agent node in n8n connects to your Ollama endpoint directly. Practical workflows you can build in an afternoon:
- Email triage: Email arrives → Ollama classifies priority → routes to the right folder or sends a draft reply
- Content pipeline: RSS feed → local LLM summarizes → formats → posts to CMS
- Code review hook: PR webhook → local model reviews for obvious bugs → comments on GitHub
- Data extraction: Upload PDFs → LLM extracts structured data → writes to Postgres
For connecting n8n to money-making agent workflows, the passive income infrastructure post covers the production layer in detail.
4. LiteLLM — The Unified Proxy
Once you have Ollama plus Claude API plus maybe OpenAI, you don’t want every tool pointing at a different endpoint with different auth schemes. LiteLLM solves this: one OpenAI-compatible endpoint, all your models behind it.
pip install litellm
# litellm_config.yaml
model_list:
- model_name: local-qwen
litellm_params:
model: ollama/qwen2.5:32b
api_base: http://localhost:11434
- model_name: claude-sonnet
litellm_params:
model: claude-sonnet-4-5
api_key: sk-ant-...
- model_name: gpt-4o
litellm_params:
model: gpt-4o
api_key: sk-...
litellm --config litellm_config.yaml --port 8000
Now every tool in your stack points at http://localhost:8000. Swap models without changing application code. Add cost tracking and rate limiting across all providers from one place.
5. Dify — Agent Builder
Dify sits between “too simple to be useful” and “too complex to learn.” It’s an open-source LLM application builder with a visual workflow editor, knowledge base management, and multiple deployment targets.
git clone https://github.com/langgenius/dify.git
cd dify/docker
docker compose up -d
Browse to http://localhost. Connect it to your LiteLLM proxy and it can use any model in your stack.
Where Dify earns its place: building internal tools and client-facing agents that need a proper UI. The knowledge base (RAG) is more polished than Flowise, the API export is clean, and you can deploy agents as web widgets or standalone apps. For developers building agentic products to sell, Dify is a faster path to a demo than coding the UI from scratch.
Hardware Reference
| Setup | Cost | Best for |
|---|---|---|
| M2/M3 MacBook Pro 16GB | Already own it | 7B models, casual use |
| Mac mini M4 Pro 48GB | ~$1,400 | 32B models, small team |
| Beelink/Minisforum mini PC + 64GB RAM | ~$500-800 | Linux stack, Docker-native |
| Used workstation (RTX 3090 24GB) | ~$600-900 | GPU inference, faster than CPU |
| Raspberry Pi 5 8GB | ~$80 | Lightweight models only (3B max) |
The mini PC path is the most cost-efficient if you’re already comfortable with Linux. Apple Silicon runs CPU inference faster than comparable Intel/AMD hardware due to unified memory bandwidth — but NVIDIA GPUs still win on pure tokens/sec for larger models.
The Privacy Stack in Production
For client work or anything with sensitive data, this setup means zero data leaves your network:
Client Request
↓
Traefik (HTTPS, reverse proxy)
↓
LiteLLM (model routing, auth)
↓
Ollama (local inference)
↓
Your application
Add Traefik in front with Let’s Encrypt for TLS:
docker run -d \
-p 80:80 -p 443:443 \
-v /var/run/docker.sock:/var/run/docker.sock \
-v traefik_data:/etc/traefik \
traefik:v3.0 \
--entrypoints.web.address=:80 \
--entrypoints.websecure.address=:443 \
--certificatesresolvers.letsencrypt.acme.email=you@example.com
Everything behind HTTPS, no external API calls, logs stay on your machine.
When to Use Cloud vs Local
Self-hosted doesn’t mean always local. The practical rule:
Use local models for:
- High-volume classification or tagging (thousands of items/day)
- Sensitive data processing
- RAG over your own documents
- Anywhere you want zero marginal cost per token
Use the Claude API for:
- Complex multi-step reasoning where quality is measurable
- Tasks that need current knowledge or web search
- User-facing features where response quality matters a lot
- Anything where local hardware can’t handle the load
The Claude API tutorial covers the cloud side. The stack above covers the local side. Most serious setups run both.
Getting Started: The 60-Minute Path
- Install Ollama, pull
qwen2.5:7b(fits in 8GB RAM): 10 min - Stand up Open WebUI with Docker: 10 min
- Chat with a local model, verify it works: 5 min
- Install n8n, create one workflow (email or webhook trigger): 30 min
- Pull a 32B model overnight if your hardware supports it: passive
That’s the baseline. Everything else — LiteLLM, Dify, the full Docker Compose stack — layers on once you know the foundation works.