Self-Hosted AI Stack — Feb 2026

The Ultimate AI Stack

A unified guide combining the best practices from multiple AI sources. Local LLMs, MCP integrations, multi-agent orchestration, and proactive automation — all on your hardware.

LM Studio MCP Servers Agent Swarm Claude Code n8n / LangGraph
0
Billion Parameters
0
Specialized Agents
0
MCP Servers
0
Tokens/sec Local
01 — Overview

The Master Plan

A production-ready AI stack built on local hardware, paid subscriptions, and open-source orchestration.

What We're Building

A layered, hybrid AI system where a local 30B model serves as the always-on brain, routing tasks between free local inference and paid cloud APIs. MCP servers act as the universal nervous system connecting everything together.

The architecture is divided into five operational layers: local inference, MCP integration, agent orchestration, coding environment, and omni-channel interfaces.

Why This Architecture

  • ✓ 70-80% of tasks run locally for free
  • ✓ Cloud APIs only used for tasks requiring frontier intelligence
  • ✓ Your data stays private on your hardware
  • ✓ No vendor lock-in — MCP standardizes everything
  • ✓ Proactive agents, not just reactive chatbots
  • ✓ Fully self-hosted orchestration (no OpenClaw costs)

Your Hardware

💻 Desktop (Primary)

GPUNVIDIA RTX 5080
VRAM16 GB
OSWSL (Ubuntu) on Windows
RolePrimary AI server & agent host

💻 MacBook (Mobile)

ChipApple M1 Pro
Memory32 GB Unified
OSmacOS
RoleMobile assistant & secondary

Your Subscriptions

ServiceWhat It Gives YouRole in Stack
Claude ProClaude API, Claude Code, CodexDeep reasoning, coding, architecture
GPT ProGPT-4 access, API creditsResearch, web browsing, fallback
Gemini ProGemini API, long contextDocument analysis, research
LM StudioLocal model hosting, OpenAI-compatible APIAlways-on brain, routing, free inference
⚠ Subscription vs. API Billing Consumer subscriptions (Claude Pro, ChatGPT Plus, Gemini Advanced) do NOT include API credits. API billing is always separate. Your local LM Studio handles the bulk of work for free, and cloud APIs are only triggered selectively.

02 — Foundations

Key Concepts Explained

The building blocks you need to understand before everything else clicks.

🎲 Tokens — The Fundamental Unit

Tokens are chunks of text that AI models process. They're not whole words — they're building blocks that can be as short as a single character or as long as a word.

Rule of thumb: 1 token ≈ ¾ of a word. So 1,000 tokens ≈ 750 words.

Context window is how many tokens the model can process at once — its working memory. A 32K context window means ~24,000 words of combined input + output.

More context = more VRAM consumed by the KV cache. Start at 32K and increase only if needed.

📈 KV Cache — The Hidden Bottleneck

Auto-regressive generation requires LLMs to compute Keys and Values for each token to maintain context. Instead of recomputing everything each time, the KV cache stores these values for reuse.

The problem: KV cache grows linearly with context length and can consume gigabytes of VRAM. If it exceeds GPU memory, the system either crashes (OOM) or swaps to disk, degrading from dozens of tokens/sec to fractions.

Practical fix: Even if a model supports 256K tokens, constrain context_length in LM Studio to 32,768 or 65,536 to stay stable on 16GB-32GB hardware.
📦 Quantization — Making 30B Fit

Quantization is compression for AI models. It reduces the precision of model weights to make them smaller and faster, at the cost of some quality.

QuantQualitySize (30B)When to Use
Q8Highest~31 GBMassive RAM available
Q5_K_MGreat~21 GBModel barely fits GPU
Q4_K_MSweet spot~18 GBBest balance for most
Q3_K_MNoticeable loss~14 GBVRAM is very tight
MXFP4Native~13 GBPurpose-built (GPT-OSS)
⚡ Mixture of Experts (MoE) — Speed Hack

MoE is a paradigm shift: instead of activating all parameters for every token (dense), MoE selectively activates specialized subsets. A 30B model might only use 3B parameters at any moment.

Why it matters: You get the intelligence of a large model with the speed and VRAM usage of a small one. Inactive weights sit in slower system RAM while active experts reside in GPU VRAM.

GPT-OSS 20B: 21B total, 3.6B active • Qwen3-30B-A3B: 30.5B total, 3.3B active

MoE models also excel at context switching — vital for agents that pivot between code, accounting data, and conversation.

🔌 MCP (Model Context Protocol) — The Universal Adapter

MCP is the open standard that eliminates custom API wrappers. It standardizes how AI models connect to external data sources, tools, and workflows — like a universal "USB-C" for AI.

Architecture: Host (LM Studio / Claude Code) → Client (translates intent to JSON-RPC) → Server (connects to actual data).

Three primitives:

  • Resources: Read-only context (files, schemas, docs)
  • Tools: Executable functions (send email, query DB, create event)
  • Prompts: Pre-defined instruction templates for consistent behavior
Without MCPs: AI can only chat in text.
With MCPs: AI reads emails, posts to Discord, checks calendars, edits files, and takes real-world actions.

03 — Model Selection

The 30B Model Roster

Choosing the right models for assistant, coding, and agentic workflows.

ModelArchitectureContextStrengthsUse For
GPT-OSS 20B Daily Driver Dense (21B / 3.6B active) 128K Native function calling, web browsing, structured outputs, configurable reasoning Agent brain, routing, daily chat
Qwen3-Coder-30B-A3B MoE 30.5B / 3.3B active 256K Exceptional repo-level code understanding, reliable JSON tool calls Local coding specialist
GLM-4.7-Flash MoE 30B / 3B active 128K UI generation, tool execution, SWE-bench performance Alternative coding model
Qwen3-30B-A3B MoE 30.5B / 3.3B active 131K (YaRN) Balanced assistant + planning + light coding + tool calling General purpose (MacBook)
Nemotron-3-Nano MoE 30B / 3.5B active 1M Extreme throughput, massive context window for logs/histories Bulk data ingestion
💡 Multi-Model Strategy You swap between models in LM Studio. GPT-OSS 20B is your default loaded model on the desktop. Switch to Qwen3-Coder for coding sessions. Run Qwen3-30B on the MacBook for mobile use. The "A3B" MoE pattern (30B total, ~3B active) is the sweet spot for fast, capable local inference.
💻

Desktop Config

GPT-OSS 20B @ MXFP4 — ~13.7GB/16GB VRAM. 999 GPU layers (fits entirely). ~42 t/s at 32K context.

🔨

Coding Config

Qwen3-Coder-30B @ Q4_K_M — ~18GB, partial CPU offload. GPU offload ~80%. ~12-15 t/s.

📱

MacBook Config

Qwen3-30B-A3B @ Q4_K_M — ~18GB/32GB unified. MLX engine. ~15-20 t/s.


04 — Foundation

LM Studio Setup

Your local AI runtime — always-on, zero-cost inference backbone.

STEP 1
Download & Install LM Studio
Get the latest version from lmstudio.ai. It supports Windows, macOS, and Linux.
STEP 2
Download GPT-OSS 20B
Open LM Studio, search "openai/gpt-oss-20b" and download the ~13GB MXFP4 version.
STEP 3
Load the Model
Click the model in the sidebar and hit "Load." Verify ~42 t/s in the status bar.
STEP 4
Enable Local API Server
Go to the Developer tab, toggle the local API server ON. It runs at http://localhost:1234/v1 — this is your OpenAI-compatible endpoint.
STEP 5
Verify Everything Works
Open a chat in LM Studio and test. Then test the API endpoint from a script or curl.
🔐 Why localhost:1234 Matters This local API is the backbone of your entire system. Anything that can talk to the OpenAI API can now talk to your local model for free. MCPs, agents, scripts, your Discord bot — they all connect here. One URL, unlimited free inference.

Recommended Settings

SettingValueWhy
Context Length32,768 (push to 60K)Balances memory and capability
GPU Layers999 (all on GPU)GPT-OSS fits in 16GB
VRAM Usage~13.7 GBLeaves 2.3GB for KV cache
Reasoning EffortMedium (default)Low for routing, High for complex tasks
Temperature0.3-0.5 (agents) / 0.7 (code)Lower = more deterministic
CLI
# Download & load your daily driver
lms get openai/gpt-oss-20b
lms load openai/gpt-oss-20b

# Download the coding model
lms get lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Server management
lms server start
lms status
lms server stop

05 — Integration

MCP Server Setup

The nervous system that connects your AI to the real world.

📁
Filesystem
Read/write files on your machine. Agent access to project files and logs.
💬
iMessage
Read and send iMessages. Auto-reply to texts with AI-generated responses.
🎮
Discord
Send/read messages via webhooks. Morning briefings and command interface.
🐙
GitHub
Interact with repos, PRs, issues. Code agent workflows and automation.
🌐
Browser / Web
Search and scrape the web. Research agent with Brave Search or Firecrawl.
📅
Google Calendar
Read/write calendar events. Scheduling and daily briefings.
Gmail
Read and draft emails. Email summaries and auto-drafts.
🗃
SQLite / Database
Query databases. Accounting tools data access and web app backends.
JSON — claude_desktop_config.json (full config — copy & paste ready)
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/your-user/projects"]
    },
    "brave-search": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-brave-search"],
      "env": {
        "BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
      }
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "YOUR_GITHUB_TOKEN"
      }
    },
    "sqlite": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-sqlite", "/path/to/your/database.db"]
    },
    "google-calendar": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-google-calendar"]
    },
    "gmail": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-gmail"]
    },
    "discord": {
      "command": "npx",
      "args": ["-y", "discord-mcp"],
      "env": {
        "DISCORD_WEBHOOK_URL": "YOUR_DISCORD_WEBHOOK_URL"
      }
    },
    "imessage": {
      "command": "npx",
      "args": ["-y", "imessage-mcp"]
    }
  }
}
💡 One Toolbox, Multiple Brains MCP is designed so multiple clients can connect to the same servers. Configure once, use from LM Studio, Claude Code, and any custom agent. The flow: You ask a question → LM Studio calls the MCP server → MCP returns data → Model formats the response.

Scaling Token Consumption

The Problem

If connected to 20 MCP servers, tool definitions alone can consume 15,000+ tokens before a single query is processed. This drains context and kills local model performance.

The Solution

Use code execution for dynamic tool discovery. Give the agent a single Python sandbox tool, let it query available MCP schemas at runtime. Reduces token overhead by ~98%.


06 — Agents

Agent Architecture

Specialized agents, each running on the best model for its task.

🔬 Router
GPT-OSS 20B (Local)
Triages all incoming tasks, decides local vs. cloud, selects the right agent and tools.
Free
🧠 Deep Thinker
Claude Pro API
Complex reasoning, architecture decisions, multi-step analysis requiring frontier intelligence.
Subscription
💻 Coder
Claude Code + Codex
Build and ship code. Complex refactoring, testing, PR creation, and async batch tasks.
Subscription
🔎 Researcher
Gemini / GPT Pro
Web search, document analysis, summarization, long-context research tasks.
Subscription
💬 Comms
GPT-OSS 20B (Local)
iMessage auto-replies, email drafts, communication triage and response.
Free
🌅 Briefing Bot
GPT-OSS 20B (Local)
Daily morning briefing aggregation, formatting, and Discord delivery.
Free
📡 Social Media
GPT-OSS 20B (Local)
Post scheduling, content creation, trend monitoring, audience engagement.
Free
💰 Smart Routing Saves Money The Router agent running on your free local model handles 70-80% of all tasks. Only genuinely complex work gets forwarded to paid APIs. Your Claude/GPT/Gemini subscriptions last much longer.

⚡ Try the Router

Type a task to see which agent would handle it:

Type a task above to see the routing decision...

07 — Orchestration

Self-Hosted Orchestration

n8n for visual workflows, LangGraph for code-driven multi-agent systems. No OpenClaw needed.

🛠 n8n — Visual Automation

Enterprise-grade workflow orchestrator, self-hosted via Docker. Visual node-based interface for building complex, asynchronous AI pipelines.

  • ✓ Native LangChain integration for RAG pipelines
  • ✓ Memory buffers for conversation history
  • ✓ Queue Mode (Redis + PostgreSQL) for scaling
  • ✓ Schedule triggers for cron-style automation
  • ✓ Connect to LM Studio API directly

Best For

Morning briefing pipelines
Social media scheduling
Email triage workflows
Accounting data pipelines
Multi-API orchestration

🛠 LangGraph — Code-Driven Agents

Python-based framework modeling agent behavior as a stateful graph. Nodes = actions, edges = logical flow. Supports cyclical reasoning and self-correction.

  • ✓ Reflection pattern for iterative improvement
  • ✓ Multi-agent collaboration loops
  • ✓ State management across complex workflows
  • ✓ Integrate with LM Studio for local inference
  • ✓ Ideal for self-correcting coding pipelines

Example: Self-Correcting Coder

Coder Agent writes update
Testing Agent runs code
Error? Loop back to Coder
Pass? Deploy
Featuren8nLangGraph
InterfaceVisual / Node-basedCode / Python
Best ForLinear/branching API workflowsCyclical reasoning, self-correction
SchedulingBuilt-in cron triggersExternal (cron/systemd)
ScalingQueue Mode (Redis)Custom scaling
AI IntegrationNative LangChain nodesNative LLM integration
DeploymentDocker self-hostedPython script/service
Learning CurveLow (visual)Medium (Python)

08 — Proactive

Morning Briefing System

A proactive assistant that delivers a personalized briefing to Discord every morning.

Architecture Flow

⏰ WSL Cron Job (7:00 AM Daily)
📄 Python Script — Data Aggregation
Gmail Calendar GitHub News System Health Accounting DB
🧠 LM Studio API (localhost:1234) — GPT-OSS 20B formats briefing
🎮 Discord Webhook → #morning-briefing channel

How Cron Jobs Work

None of the AI services have built-in scheduling. Cron jobs live on your machine and call the AI when it's time. The cron job is just a scheduler. The Python script is the glue between your schedule, data sources, and the AI.

Bash
# Open crontab editor
crontab -e

# Add this line (runs daily at 7:00 AM):
0 7 * * * /usr/bin/python3 /home/manuel/scripts/morning_briefing.py

Briefing Categories

🚨 Urgent Communications — High-priority emails & messages
📅 Daily Itinerary — Calendar events for the day
📋 Task Priorities — GitHub PRs, issues, deadlines
📈 System Health — Server status, Docker containers
📰 News & Trends — Relevant tech and market updates
💬 Interactive Discord Bot Beyond the morning briefing, set up a Discord bot backed by your local model. Message it commands like "summarize my emails" or "what PRs need review." The interface is bidirectional — reply to the briefing and the agent takes action.

09 — Development

Coding Environment & Hybrid Routing

Claude Code, Aider, and Cline with local+cloud hybrid architecture.

🔨

Claude Code

Command-line coding agent in your terminal. Attach MCPs (filesystem, GitHub, database) so it can browse your codebase, run tests, and push code. Use for complex refactoring and architecture decisions.

🚀

Codex

Async coding agent that works in the background. Queue up work items and let Codex build while you do other things. It creates PRs when done. Ideal for batch tasks like "add error handling to all endpoints."

The Hybrid Routing Architecture

Phase 1: Local Discovery

Use the local LM Studio model for lightweight, token-heavy exploration tasks: finding files, checking git history, grepping for functions, aggregating context. Saves API credits by keeping the noise local.

Phase 2: Cloud Synthesis

Once the local agent has compiled the refined context, pass it to a frontier model (Claude Sonnet, GPT-4o) for final synthesis. Cloud tokens are spent exclusively on high-level reasoning.

Bash — Connect Claude Code to Local LM Studio
# Route Claude Code to your local LM Studio
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio

# Now Claude Code uses your local model for free
# Qwen3-Coder or GLM-4.7 can explore, read files,
# and execute terminal commands through the agent
💡 Dream Workflow: Discord to Codex Pipeline Describe a feature in Discord → Router agent picks it up → Routes to Codex → Codex builds async → GitHub PR notification when done. Hands-free development.

10 — Integration

Full System Architecture

How every piece connects into one unified ecosystem.

👤 YOU
Discord iMessage Terminal Web Browser
🧠 ORCHESTRATOR — GPT-OSS 20B via LM Studio API
Simple Tasks
Local — FREE
Complex Tasks
Claude / GPT / Gemini
Code Tasks
Claude Code / Codex
Scheduled Tasks
Cron & n8n
🔌 MCP Layer
Discord iMessage GitHub Files Database Calendar Gmail Browser

11 — Security

Security & Guardrails

Permissions, secrets, audits — because always-on agents need safety by design.

🔒

Least Privilege

Every MCP server gets the minimum permissions it needs. Filesystem access is scoped to allowlisted directories only. No blanket access.

🛡

Compartmentalize

High-risk MCPs (iMessage, financial data) run in isolated environments. Never give one agent access to everything.

👥

Human Approval

High-impact actions (posting publicly, sending client messages, moving money) always require human confirmation first.

⚠ Critical Warnings Never install MCP servers from untrusted sources. Keep servers updated (real CVEs exist, like the filesystem path validation bypass). For business data (accounting), local-only processing isn't just a preference — it's a compliance requirement.

iMessage Safety

A hallucination could send inappropriate messages to professional contacts. Enforce Strict Draft-Only Mode: agent formulates responses, stages them in Discord/Notion for human approval before sending.

Accounting Data

Financial pipelines must keep the LLM out of arithmetic. Use local models for semantic extraction only. All math through deterministic Python/JavaScript. Full audit trail on every transaction.


12 — Reference

Quick Reference

Commands, settings, and API snippets you'll use every day.

Download modellms get openai/gpt-oss-20b
Load modellms load openai/gpt-oss-20b
Unload modellms unload
Check statuslms status
Start API serverlms server start
Stop API serverlms server stop
Python
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"
)

result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(result.choices[0].message.content)
SettingWhat It DoesRecommended
Context LengthWorking memory size32,768 tokens (start here)
GPU LayersHow much on GPU999 for GPT-OSS, 60-70 for Qwen3-Coder
TemperatureCreativity level0.3-0.5 agents, 0.7 code, 0.8+ creative
Top PWord choice diversity0.8 Qwen, 1.0 GPT-OSS
Top KCandidate word limit20 Qwen, 0 (disabled) GPT-OSS
Repetition PenaltyPrevents loops1.05 Qwen, 1.0 GPT-OSS
Local APIhttp://localhost:1234/v1
LM Studiohttps://lmstudio.ai
MCP Servershttps://github.com/modelcontextprotocol/servers
GPT-OSS Docshttps://huggingface.co/openai/gpt-oss-20b
n8n Self-Hosthttps://docs.n8n.io/hosting/
LangGraphhttps://docs.langchain.com/oss/python/langgraph/overview