A unified guide combining the best practices from multiple AI sources. Local LLMs, MCP integrations, multi-agent orchestration, and proactive automation — all on your hardware.
A production-ready AI stack built on local hardware, paid subscriptions, and open-source orchestration.
A layered, hybrid AI system where a local 30B model serves as the always-on brain, routing tasks between free local inference and paid cloud APIs. MCP servers act as the universal nervous system connecting everything together.
The architecture is divided into five operational layers: local inference, MCP integration, agent orchestration, coding environment, and omni-channel interfaces.
| Service | What It Gives You | Role in Stack |
|---|---|---|
| Claude Pro | Claude API, Claude Code, Codex | Deep reasoning, coding, architecture |
| GPT Pro | GPT-4 access, API credits | Research, web browsing, fallback |
| Gemini Pro | Gemini API, long context | Document analysis, research |
| LM Studio | Local model hosting, OpenAI-compatible API | Always-on brain, routing, free inference |
The building blocks you need to understand before everything else clicks.
Tokens are chunks of text that AI models process. They're not whole words — they're building blocks that can be as short as a single character or as long as a word.
Context window is how many tokens the model can process at once — its working memory. A 32K context window means ~24,000 words of combined input + output.
More context = more VRAM consumed by the KV cache. Start at 32K and increase only if needed.
Auto-regressive generation requires LLMs to compute Keys and Values for each token to maintain context. Instead of recomputing everything each time, the KV cache stores these values for reuse.
The problem: KV cache grows linearly with context length and can consume gigabytes of VRAM. If it exceeds GPU memory, the system either crashes (OOM) or swaps to disk, degrading from dozens of tokens/sec to fractions.
Quantization is compression for AI models. It reduces the precision of model weights to make them smaller and faster, at the cost of some quality.
| Quant | Quality | Size (30B) | When to Use |
|---|---|---|---|
| Q8 | Highest | ~31 GB | Massive RAM available |
| Q5_K_M | Great | ~21 GB | Model barely fits GPU |
| Q4_K_M | Sweet spot | ~18 GB | Best balance for most |
| Q3_K_M | Noticeable loss | ~14 GB | VRAM is very tight |
| MXFP4 | Native | ~13 GB | Purpose-built (GPT-OSS) |
MoE is a paradigm shift: instead of activating all parameters for every token (dense), MoE selectively activates specialized subsets. A 30B model might only use 3B parameters at any moment.
GPT-OSS 20B: 21B total, 3.6B active • Qwen3-30B-A3B: 30.5B total, 3.3B active
MoE models also excel at context switching — vital for agents that pivot between code, accounting data, and conversation.
MCP is the open standard that eliminates custom API wrappers. It standardizes how AI models connect to external data sources, tools, and workflows — like a universal "USB-C" for AI.
Architecture: Host (LM Studio / Claude Code) → Client (translates intent to JSON-RPC) → Server (connects to actual data).
Three primitives:
Choosing the right models for assistant, coding, and agentic workflows.
| Model | Architecture | Context | Strengths | Use For |
|---|---|---|---|---|
| GPT-OSS 20B Daily Driver | Dense (21B / 3.6B active) | 128K | Native function calling, web browsing, structured outputs, configurable reasoning | Agent brain, routing, daily chat |
| Qwen3-Coder-30B-A3B | MoE 30.5B / 3.3B active | 256K | Exceptional repo-level code understanding, reliable JSON tool calls | Local coding specialist |
| GLM-4.7-Flash | MoE 30B / 3B active | 128K | UI generation, tool execution, SWE-bench performance | Alternative coding model |
| Qwen3-30B-A3B | MoE 30.5B / 3.3B active | 131K (YaRN) | Balanced assistant + planning + light coding + tool calling | General purpose (MacBook) |
| Nemotron-3-Nano | MoE 30B / 3.5B active | 1M | Extreme throughput, massive context window for logs/histories | Bulk data ingestion |
GPT-OSS 20B @ MXFP4 — ~13.7GB/16GB VRAM. 999 GPU layers (fits entirely). ~42 t/s at 32K context.
Qwen3-Coder-30B @ Q4_K_M — ~18GB, partial CPU offload. GPU offload ~80%. ~12-15 t/s.
Qwen3-30B-A3B @ Q4_K_M — ~18GB/32GB unified. MLX engine. ~15-20 t/s.
Your local AI runtime — always-on, zero-cost inference backbone.
http://localhost:1234/v1 — this is your OpenAI-compatible endpoint.| Setting | Value | Why |
|---|---|---|
| Context Length | 32,768 (push to 60K) | Balances memory and capability |
| GPU Layers | 999 (all on GPU) | GPT-OSS fits in 16GB |
| VRAM Usage | ~13.7 GB | Leaves 2.3GB for KV cache |
| Reasoning Effort | Medium (default) | Low for routing, High for complex tasks |
| Temperature | 0.3-0.5 (agents) / 0.7 (code) | Lower = more deterministic |
# Download & load your daily driver lms get openai/gpt-oss-20b lms load openai/gpt-oss-20b # Download the coding model lms get lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-GGUF # Server management lms server start lms status lms server stop
The nervous system that connects your AI to the real world.
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/your-user/projects"]
},
"brave-search": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-brave-search"],
"env": {
"BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
}
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
"GITHUB_PERSONAL_ACCESS_TOKEN": "YOUR_GITHUB_TOKEN"
}
},
"sqlite": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-sqlite", "/path/to/your/database.db"]
},
"google-calendar": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-google-calendar"]
},
"gmail": {
"command": "npx",
"args": ["-y", "@anthropic/mcp-server-gmail"]
},
"discord": {
"command": "npx",
"args": ["-y", "discord-mcp"],
"env": {
"DISCORD_WEBHOOK_URL": "YOUR_DISCORD_WEBHOOK_URL"
}
},
"imessage": {
"command": "npx",
"args": ["-y", "imessage-mcp"]
}
}
}
If connected to 20 MCP servers, tool definitions alone can consume 15,000+ tokens before a single query is processed. This drains context and kills local model performance.
Use code execution for dynamic tool discovery. Give the agent a single Python sandbox tool, let it query available MCP schemas at runtime. Reduces token overhead by ~98%.
Specialized agents, each running on the best model for its task.
Type a task to see which agent would handle it:
n8n for visual workflows, LangGraph for code-driven multi-agent systems. No OpenClaw needed.
Enterprise-grade workflow orchestrator, self-hosted via Docker. Visual node-based interface for building complex, asynchronous AI pipelines.
Python-based framework modeling agent behavior as a stateful graph. Nodes = actions, edges = logical flow. Supports cyclical reasoning and self-correction.
| Feature | n8n | LangGraph |
|---|---|---|
| Interface | Visual / Node-based | Code / Python |
| Best For | Linear/branching API workflows | Cyclical reasoning, self-correction |
| Scheduling | Built-in cron triggers | External (cron/systemd) |
| Scaling | Queue Mode (Redis) | Custom scaling |
| AI Integration | Native LangChain nodes | Native LLM integration |
| Deployment | Docker self-hosted | Python script/service |
| Learning Curve | Low (visual) | Medium (Python) |
A proactive assistant that delivers a personalized briefing to Discord every morning.
None of the AI services have built-in scheduling. Cron jobs live on your machine and call the AI when it's time. The cron job is just a scheduler. The Python script is the glue between your schedule, data sources, and the AI.
# Open crontab editor crontab -e # Add this line (runs daily at 7:00 AM): 0 7 * * * /usr/bin/python3 /home/manuel/scripts/morning_briefing.py
Claude Code, Aider, and Cline with local+cloud hybrid architecture.
Command-line coding agent in your terminal. Attach MCPs (filesystem, GitHub, database) so it can browse your codebase, run tests, and push code. Use for complex refactoring and architecture decisions.
Async coding agent that works in the background. Queue up work items and let Codex build while you do other things. It creates PRs when done. Ideal for batch tasks like "add error handling to all endpoints."
Use the local LM Studio model for lightweight, token-heavy exploration tasks: finding files, checking git history, grepping for functions, aggregating context. Saves API credits by keeping the noise local.
Once the local agent has compiled the refined context, pass it to a frontier model (Claude Sonnet, GPT-4o) for final synthesis. Cloud tokens are spent exclusively on high-level reasoning.
# Route Claude Code to your local LM Studio export ANTHROPIC_BASE_URL=http://localhost:1234 export ANTHROPIC_AUTH_TOKEN=lmstudio # Now Claude Code uses your local model for free # Qwen3-Coder or GLM-4.7 can explore, read files, # and execute terminal commands through the agent
How every piece connects into one unified ecosystem.
Permissions, secrets, audits — because always-on agents need safety by design.
Every MCP server gets the minimum permissions it needs. Filesystem access is scoped to allowlisted directories only. No blanket access.
High-risk MCPs (iMessage, financial data) run in isolated environments. Never give one agent access to everything.
High-impact actions (posting publicly, sending client messages, moving money) always require human confirmation first.
A hallucination could send inappropriate messages to professional contacts. Enforce Strict Draft-Only Mode: agent formulates responses, stages them in Discord/Notion for human approval before sending.
Financial pipelines must keep the LLM out of arithmetic. Use local models for semantic extraction only. All math through deterministic Python/JavaScript. Full audit trail on every transaction.
Commands, settings, and API snippets you'll use every day.
lms get openai/gpt-oss-20blms load openai/gpt-oss-20blms unloadlms statuslms server startlms server stopfrom openai import OpenAI client = OpenAI( base_url="http://localhost:1234/v1", api_key="not-needed" ) result = client.chat.completions.create( model="openai/gpt-oss-20b", messages=[{"role": "user", "content": "Hello!"}] ) print(result.choices[0].message.content)
| Setting | What It Does | Recommended |
|---|---|---|
| Context Length | Working memory size | 32,768 tokens (start here) |
| GPU Layers | How much on GPU | 999 for GPT-OSS, 60-70 for Qwen3-Coder |
| Temperature | Creativity level | 0.3-0.5 agents, 0.7 code, 0.8+ creative |
| Top P | Word choice diversity | 0.8 Qwen, 1.0 GPT-OSS |
| Top K | Candidate word limit | 20 Qwen, 0 (disabled) GPT-OSS |
| Repetition Penalty | Prevents loops | 1.05 Qwen, 1.0 GPT-OSS |
http://localhost:1234/v1https://lmstudio.aihttps://github.com/modelcontextprotocol/servershttps://huggingface.co/openai/gpt-oss-20bhttps://docs.n8n.io/hosting/https://docs.langchain.com/oss/python/langgraph/overview