The Ultimate AI Stack | LM Studio + MCP + Agents

01 — Overview

The Master Plan

A production-ready AI stack built on local hardware, paid subscriptions, and open-source orchestration.

What We're Building

A layered, hybrid AI system where a local 30B model serves as the always-on brain, routing tasks between free local inference and paid cloud APIs. MCP servers act as the universal nervous system connecting everything together.

The architecture is divided into five operational layers: local inference, MCP integration, agent orchestration, coding environment, and omni-channel interfaces.

Why This Architecture

✓ 70-80% of tasks run locally for free
✓ Cloud APIs only used for tasks requiring frontier intelligence
✓ Your data stays private on your hardware
✓ No vendor lock-in — MCP standardizes everything
✓ Proactive agents, not just reactive chatbots
✓ Fully self-hosted orchestration (no OpenClaw costs)

Your Hardware

💻 Desktop (Primary)

GPUNVIDIA RTX 5080

VRAM16 GB

OSWSL (Ubuntu) on Windows

RolePrimary AI server & agent host

💻 MacBook (Mobile)

ChipApple M1 Pro

Memory32 GB Unified

OSmacOS

RoleMobile assistant & secondary

Your Subscriptions

Service	What It Gives You	Role in Stack
Claude Pro	Claude API, Claude Code, Codex	Deep reasoning, coding, architecture
GPT Pro	GPT-4 access, API credits	Research, web browsing, fallback
Gemini Pro	Gemini API, long context	Document analysis, research
LM Studio	Local model hosting, OpenAI-compatible API	Always-on brain, routing, free inference

⚠ Subscription vs. API Billing Consumer subscriptions (Claude Pro, ChatGPT Plus, Gemini Advanced) do NOT include API credits. API billing is always separate. Your local LM Studio handles the bulk of work for free, and cloud APIs are only triggered selectively.

02 — Foundations

Key Concepts Explained

The building blocks you need to understand before everything else clicks.

🎲 Tokens — The Fundamental Unit ▼

Tokens are chunks of text that AI models process. They're not whole words — they're building blocks that can be as short as a single character or as long as a word.

Rule of thumb: 1 token ≈ ¾ of a word. So 1,000 tokens ≈ 750 words.

Context window is how many tokens the model can process at once — its working memory. A 32K context window means ~24,000 words of combined input + output.

More context = more VRAM consumed by the KV cache. Start at 32K and increase only if needed.

📈 KV Cache — The Hidden Bottleneck ▼

Auto-regressive generation requires LLMs to compute Keys and Values for each token to maintain context. Instead of recomputing everything each time, the KV cache stores these values for reuse.

The problem: KV cache grows linearly with context length and can consume gigabytes of VRAM. If it exceeds GPU memory, the system either crashes (OOM) or swaps to disk, degrading from dozens of tokens/sec to fractions.

Practical fix: Even if a model supports 256K tokens, constrain context_length in LM Studio to 32,768 or 65,536 to stay stable on 16GB-32GB hardware.

📦 Quantization — Making 30B Fit ▼

Quantization is compression for AI models. It reduces the precision of model weights to make them smaller and faster, at the cost of some quality.

Quant	Quality	Size (30B)	When to Use
Q8	Highest	~31 GB	Massive RAM available
Q5_K_M	Great	~21 GB	Model barely fits GPU
Q4_K_M	Sweet spot	~18 GB	Best balance for most
Q3_K_M	Noticeable loss	~14 GB	VRAM is very tight
MXFP4	Native	~13 GB	Purpose-built (GPT-OSS)

⚡ Mixture of Experts (MoE) — Speed Hack ▼

MoE is a paradigm shift: instead of activating all parameters for every token (dense), MoE selectively activates specialized subsets. A 30B model might only use 3B parameters at any moment.

Why it matters: You get the intelligence of a large model with the speed and VRAM usage of a small one. Inactive weights sit in slower system RAM while active experts reside in GPU VRAM.

GPT-OSS 20B: 21B total, 3.6B active • Qwen3-30B-A3B: 30.5B total, 3.3B active

MoE models also excel at context switching — vital for agents that pivot between code, accounting data, and conversation.

🔌 MCP (Model Context Protocol) — The Universal Adapter ▼

MCP is the open standard that eliminates custom API wrappers. It standardizes how AI models connect to external data sources, tools, and workflows — like a universal "USB-C" for AI.

Architecture: Host (LM Studio / Claude Code) → Client (translates intent to JSON-RPC) → Server (connects to actual data).

Three primitives:

Resources: Read-only context (files, schemas, docs)
Tools: Executable functions (send email, query DB, create event)
Prompts: Pre-defined instruction templates for consistent behavior

Without MCPs: AI can only chat in text.
With MCPs: AI reads emails, posts to Discord, checks calendars, edits files, and takes real-world actions.

03 — Model Selection

The 30B Model Roster

Choosing the right models for assistant, coding, and agentic workflows.

Model	Architecture	Context	Strengths	Use For
GPT-OSS 20B Daily Driver	Dense (21B / 3.6B active)	128K	Native function calling, web browsing, structured outputs, configurable reasoning	Agent brain, routing, daily chat
Qwen3-Coder-30B-A3B	MoE 30.5B / 3.3B active	256K	Exceptional repo-level code understanding, reliable JSON tool calls	Local coding specialist
GLM-4.7-Flash	MoE 30B / 3B active	128K	UI generation, tool execution, SWE-bench performance	Alternative coding model
Qwen3-30B-A3B	MoE 30.5B / 3.3B active	131K (YaRN)	Balanced assistant + planning + light coding + tool calling	General purpose (MacBook)
Nemotron-3-Nano	MoE 30B / 3.5B active	1M	Extreme throughput, massive context window for logs/histories	Bulk data ingestion

💡 Multi-Model Strategy You swap between models in LM Studio. GPT-OSS 20B is your default loaded model on the desktop. Switch to Qwen3-Coder for coding sessions. Run Qwen3-30B on the MacBook for mobile use. The "A3B" MoE pattern (30B total, ~3B active) is the sweet spot for fast, capable local inference.

💻

Desktop Config

GPT-OSS 20B @ MXFP4 — ~13.7GB/16GB VRAM. 999 GPU layers (fits entirely). ~42 t/s at 32K context.

🔨

Coding Config

Qwen3-Coder-30B @ Q4_K_M — ~18GB, partial CPU offload. GPU offload ~80%. ~12-15 t/s.

📱

MacBook Config

Qwen3-30B-A3B @ Q4_K_M — ~18GB/32GB unified. MLX engine. ~15-20 t/s.

04 — Foundation

LM Studio Setup

Your local AI runtime — always-on, zero-cost inference backbone.

STEP 1

Download & Install LM Studio

Get the latest version from lmstudio.ai. It supports Windows, macOS, and Linux.

STEP 2

Download GPT-OSS 20B

Open LM Studio, search "openai/gpt-oss-20b" and download the ~13GB MXFP4 version.

STEP 3

Load the Model

Click the model in the sidebar and hit "Load." Verify ~42 t/s in the status bar.

STEP 4

Enable Local API Server

Go to the Developer tab, toggle the local API server ON. It runs at http://localhost:1234/v1 — this is your OpenAI-compatible endpoint.

STEP 5

Verify Everything Works

Open a chat in LM Studio and test. Then test the API endpoint from a script or curl.

🔐 Why localhost:1234 Matters This local API is the backbone of your entire system. Anything that can talk to the OpenAI API can now talk to your local model for free. MCPs, agents, scripts, your Discord bot — they all connect here. One URL, unlimited free inference.

Recommended Settings

Setting	Value	Why
Context Length	32,768 (push to 60K)	Balances memory and capability
GPU Layers	999 (all on GPU)	GPT-OSS fits in 16GB
VRAM Usage	~13.7 GB	Leaves 2.3GB for KV cache
Reasoning Effort	Medium (default)	Low for routing, High for complex tasks
Temperature	0.3-0.5 (agents) / 0.7 (code)	Lower = more deterministic

CLI

# Download & load your daily driver
lms get openai/gpt-oss-20b
lms load openai/gpt-oss-20b

# Download the coding model
lms get lmstudio-community/Qwen3-Coder-30B-A3B-Instruct-GGUF

# Server management
lms server start
lms status
lms server stop

05 — Integration

MCP Server Setup

The nervous system that connects your AI to the real world.

📁

Filesystem

Read/write files on your machine. Agent access to project files and logs.

💬

iMessage

Read and send iMessages. Auto-reply to texts with AI-generated responses.

🎮

Discord

Send/read messages via webhooks. Morning briefings and command interface.

🐙

GitHub

Interact with repos, PRs, issues. Code agent workflows and automation.

🌐

Browser / Web

Search and scrape the web. Research agent with Brave Search or Firecrawl.

📅

Google Calendar

Read/write calendar events. Scheduling and daily briefings.

✉

Gmail

Read and draft emails. Email summaries and auto-drafts.

🗃

SQLite / Database

Query databases. Accounting tools data access and web app backends.

JSON — claude_desktop_config.json (full config — copy & paste ready)

{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/home/your-user/projects"]
    },
    "brave-search": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-brave-search"],
      "env": {
        "BRAVE_API_KEY": "YOUR_BRAVE_API_KEY"
      }
    },
    "github": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-github"],
      "env": {
        "GITHUB_PERSONAL_ACCESS_TOKEN": "YOUR_GITHUB_TOKEN"
      }
    },
    "sqlite": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-sqlite", "/path/to/your/database.db"]
    },
    "google-calendar": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-google-calendar"]
    },
    "gmail": {
      "command": "npx",
      "args": ["-y", "@anthropic/mcp-server-gmail"]
    },
    "discord": {
      "command": "npx",
      "args": ["-y", "discord-mcp"],
      "env": {
        "DISCORD_WEBHOOK_URL": "YOUR_DISCORD_WEBHOOK_URL"
      }
    },
    "imessage": {
      "command": "npx",
      "args": ["-y", "imessage-mcp"]
    }
  }
}

💡 One Toolbox, Multiple Brains MCP is designed so multiple clients can connect to the same servers. Configure once, use from LM Studio, Claude Code, and any custom agent. The flow: You ask a question → LM Studio calls the MCP server → MCP returns data → Model formats the response.

Scaling Token Consumption

⚠

The Problem

If connected to 20 MCP servers, tool definitions alone can consume 15,000+ tokens before a single query is processed. This drains context and kills local model performance.

✓

The Solution

Use code execution for dynamic tool discovery. Give the agent a single Python sandbox tool, let it query available MCP schemas at runtime. Reduces token overhead by ~98%.

06 — Agents

Agent Architecture

Specialized agents, each running on the best model for its task.

🔬 Router

GPT-OSS 20B (Local)

Triages all incoming tasks, decides local vs. cloud, selects the right agent and tools.

Free

🧠 Deep Thinker

Claude Pro API

Complex reasoning, architecture decisions, multi-step analysis requiring frontier intelligence.

Subscription

💻 Coder

Claude Code + Codex

Build and ship code. Complex refactoring, testing, PR creation, and async batch tasks.

Subscription

🔎 Researcher

Gemini / GPT Pro

Web search, document analysis, summarization, long-context research tasks.

Subscription

💬 Comms

GPT-OSS 20B (Local)

iMessage auto-replies, email drafts, communication triage and response.

Free

🌅 Briefing Bot

GPT-OSS 20B (Local)

Daily morning briefing aggregation, formatting, and Discord delivery.

Free

📡 Social Media

GPT-OSS 20B (Local)

Post scheduling, content creation, trend monitoring, audience engagement.

Free

💰 Smart Routing Saves Money The Router agent running on your free local model handles 70-80% of all tasks. Only genuinely complex work gets forwarded to paid APIs. Your Claude/GPT/Gemini subscriptions last much longer.

⚡ Try the Router

Type a task to see which agent would handle it:

Type a task above to see the routing decision...

07 — Orchestration

Self-Hosted Orchestration

n8n for visual workflows, LangGraph for code-driven multi-agent systems. No OpenClaw needed.

🛠 n8n — Visual Automation

Enterprise-grade workflow orchestrator, self-hosted via Docker. Visual node-based interface for building complex, asynchronous AI pipelines.

✓ Native LangChain integration for RAG pipelines
✓ Memory buffers for conversation history
✓ Queue Mode (Redis + PostgreSQL) for scaling
✓ Schedule triggers for cron-style automation
✓ Connect to LM Studio API directly

Best For

Morning briefing pipelines

Social media scheduling

Email triage workflows

Accounting data pipelines

Multi-API orchestration

🛠 LangGraph — Code-Driven Agents

Python-based framework modeling agent behavior as a stateful graph. Nodes = actions, edges = logical flow. Supports cyclical reasoning and self-correction.

✓ Reflection pattern for iterative improvement
✓ Multi-agent collaboration loops
✓ State management across complex workflows
✓ Integrate with LM Studio for local inference
✓ Ideal for self-correcting coding pipelines

Example: Self-Correcting Coder

Coder Agent writes update

↓

Testing Agent runs code

↓

Error? Loop back to Coder

↓

Pass? Deploy

Feature	n8n	LangGraph
Interface	Visual / Node-based	Code / Python
Best For	Linear/branching API workflows	Cyclical reasoning, self-correction
Scheduling	Built-in cron triggers	External (cron/systemd)
Scaling	Queue Mode (Redis)	Custom scaling
AI Integration	Native LangChain nodes	Native LLM integration
Deployment	Docker self-hosted	Python script/service
Learning Curve	Low (visual)	Medium (Python)

08 — Proactive

Morning Briefing System

A proactive assistant that delivers a personalized briefing to Discord every morning.

Architecture Flow

⏰ WSL Cron Job (7:00 AM Daily)

📄 Python Script — Data Aggregation

Gmail Calendar GitHub News System Health Accounting DB

🧠 LM Studio API (localhost:1234) — GPT-OSS 20B formats briefing

🎮 Discord Webhook → #morning-briefing channel

How Cron Jobs Work

None of the AI services have built-in scheduling. Cron jobs live on your machine and call the AI when it's time. The cron job is just a scheduler. The Python script is the glue between your schedule, data sources, and the AI.

Bash

# Open crontab editor
crontab -e

# Add this line (runs daily at 7:00 AM):
0 7 * * * /usr/bin/python3 /home/manuel/scripts/morning_briefing.py

Briefing Categories

🚨 Urgent Communications — High-priority emails & messages

📅 Daily Itinerary — Calendar events for the day

📋 Task Priorities — GitHub PRs, issues, deadlines

📈 System Health — Server status, Docker containers

📰 News & Trends — Relevant tech and market updates

💬 Interactive Discord Bot Beyond the morning briefing, set up a Discord bot backed by your local model. Message it commands like "summarize my emails" or "what PRs need review." The interface is bidirectional — reply to the briefing and the agent takes action.

09 — Development

Coding Environment & Hybrid Routing

Claude Code, Aider, and Cline with local+cloud hybrid architecture.

🔨

Claude Code

Command-line coding agent in your terminal. Attach MCPs (filesystem, GitHub, database) so it can browse your codebase, run tests, and push code. Use for complex refactoring and architecture decisions.

🚀

Codex

Async coding agent that works in the background. Queue up work items and let Codex build while you do other things. It creates PRs when done. Ideal for batch tasks like "add error handling to all endpoints."

The Hybrid Routing Architecture

Phase 1: Local Discovery

Use the local LM Studio model for lightweight, token-heavy exploration tasks: finding files, checking git history, grepping for functions, aggregating context. Saves API credits by keeping the noise local.

Phase 2: Cloud Synthesis

Once the local agent has compiled the refined context, pass it to a frontier model (Claude Sonnet, GPT-4o) for final synthesis. Cloud tokens are spent exclusively on high-level reasoning.

Bash — Connect Claude Code to Local LM Studio

# Route Claude Code to your local LM Studio
export ANTHROPIC_BASE_URL=http://localhost:1234
export ANTHROPIC_AUTH_TOKEN=lmstudio

# Now Claude Code uses your local model for free
# Qwen3-Coder or GLM-4.7 can explore, read files,
# and execute terminal commands through the agent

💡 Dream Workflow: Discord to Codex Pipeline Describe a feature in Discord → Router agent picks it up → Routes to Codex → Codex builds async → GitHub PR notification when done. Hands-free development.

11 — Security

Security & Guardrails

Permissions, secrets, audits — because always-on agents need safety by design.

🔒

Least Privilege

Every MCP server gets the minimum permissions it needs. Filesystem access is scoped to allowlisted directories only. No blanket access.

🛡

Compartmentalize

High-risk MCPs (iMessage, financial data) run in isolated environments. Never give one agent access to everything.

👥

Human Approval

High-impact actions (posting publicly, sending client messages, moving money) always require human confirmation first.

⚠ Critical Warnings Never install MCP servers from untrusted sources. Keep servers updated (real CVEs exist, like the filesystem path validation bypass). For business data (accounting), local-only processing isn't just a preference — it's a compliance requirement.

iMessage Safety

A hallucination could send inappropriate messages to professional contacts. Enforce Strict Draft-Only Mode: agent formulates responses, stages them in Discord/Notion for human approval before sending.

Accounting Data

Financial pipelines must keep the LLM out of arithmetic. Use local models for semantic extraction only. All math through deterministic Python/JavaScript. Full audit trail on every transaction.

12 — Reference

Quick Reference

Commands, settings, and API snippets you'll use every day.

Download modellms get openai/gpt-oss-20b

Load modellms load openai/gpt-oss-20b

Unload modellms unload

Check statuslms status

Start API serverlms server start

Stop API serverlms server stop

Python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:1234/v1",
    api_key="not-needed"
)

result = client.chat.completions.create(
    model="openai/gpt-oss-20b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(result.choices[0].message.content)

Setting	What It Does	Recommended
Context Length	Working memory size	32,768 tokens (start here)
GPU Layers	How much on GPU	999 for GPT-OSS, 60-70 for Qwen3-Coder
Temperature	Creativity level	0.3-0.5 agents, 0.7 code, 0.8+ creative
Top P	Word choice diversity	0.8 Qwen, 1.0 GPT-OSS
Top K	Candidate word limit	20 Qwen, 0 (disabled) GPT-OSS
Repetition Penalty	Prevents loops	1.05 Qwen, 1.0 GPT-OSS

Local APIhttp://localhost:1234/v1

LM Studiohttps://lmstudio.ai

MCP Servershttps://github.com/modelcontextprotocol/servers

GPT-OSS Docshttps://huggingface.co/openai/gpt-oss-20b

n8n Self-Hosthttps://docs.n8n.io/hosting/

LangGraphhttps://docs.langchain.com/oss/python/langgraph/overview

The Master Plan

What We're Building

Why This Architecture

Your Hardware

💻 Desktop (Primary)

💻 MacBook (Mobile)

Your Subscriptions

Key Concepts Explained

The 30B Model Roster

Desktop Config

Coding Config

MacBook Config

LM Studio Setup

Recommended Settings

MCP Server Setup

Scaling Token Consumption

The Problem

The Solution

Agent Architecture

⚡ Try the Router

Self-Hosted Orchestration

🛠 n8n — Visual Automation

Best For

🛠 LangGraph — Code-Driven Agents

Example: Self-Correcting Coder

Morning Briefing System

Architecture Flow

How Cron Jobs Work

Briefing Categories

Coding Environment & Hybrid Routing

Claude Code

Codex

The Hybrid Routing Architecture

Phase 1: Local Discovery

Phase 2: Cloud Synthesis

Full System Architecture

Security & Guardrails

Least Privilege

Compartmentalize

Human Approval

iMessage Safety

Accounting Data

Quick Reference