Straight-Clawed: Why We Run Cloud Models for Agents and Local Models for Everything Else


TierModelProviderParamsMMLU-ProSWE-BenchHLE*Cost (in/out per 1M)
☁️ FrontierGPT-5.2 ProOpenAI / Codex~200B+88.7%65.8%$10 / $30
Claude Opus 4.6Anthropic~200B+88.2%72.5%53.1%$15 / $75
Gemini 3.1 ProGoogle~200B+89.8%63.2%51.4%$1.25 / $5
Grok 4 HeavyxAI~200B+86.4%61.0%50.0%$3 / $15
Claude Sonnet 4Anthropic~70B84.0%55.0%$3 / $15
🇨🇳 Open-SourceDeepSeek V3.2DeepSeekMoE ~200B85.9%77.8%$0.28 / $1.10
Qwen 3.5Alibaba~70B84.6%62.5%$0.50 / $2
🌐 Open-WeightLlama 4 MaverickMeta~70B83.2%55.8%Free
Mistral 3Mistral AI~70B82.8%54.1%$1 / $3
Gemma 3 27BGoogle27B~75%~30%Free
🏠 Our MiniGemma 3 4BGoogle4B~58%Free
nomic-embed-textNomic AI137MFree
whisper-small-mlxOpenAI/MLX244MFree

Figure 1: The Model Landscape — February 2026. Frontier cloud models (top) dominate agent reasoning benchmarks. Chinese open-source models match or beat them on coding at a fraction of the cost. Open-weight models are viable for batch work. Our local models (bottom) handle transcription, embeddings, and classification — we run Opus for agent reasoning only. HLE = Humanity’s Last Exam (with tools). Bolded scores = category leader. Costs are API pricing; “Free” = open-weight, you provide compute.

If you’re trying to run an AI agent on local models alone, I have bad news: it’s not going to work. Not yet. Not unless you’re sitting on $30K of Apple Silicon or a rack of A100s.

I keep having this conversation — most recently with a friend who has serious GPU hardware from crypto mining days. He’s trying to load open-source models onto his rig and use them as the brain for an OpenClaw agent deployment. It keeps breaking. He’s frustrated. And the answer is simpler than he wants it to be.

Use the frontier cloud models for agent reasoning. Use local models for everything else.

Here’s why, and here’s how we actually run it at wade.digital.

The Agent Reasoning Problem

An AI agent isn’t a chatbot. It’s a system that reads files, makes decisions, runs commands, sends messages, and holds context across sessions. The reasoning required to do this reliably — tool selection, multi-step planning, error recovery, context management — is at the absolute frontier of what language models can do.

Look at the figure above. The gap between frontier models and everything else isn’t about raw knowledge — Gemma 3 27B scores respectably on MMLU-Pro. It’s about reliability under tool use. SWE-Bench Verified measures whether a model can actually do software engineering tasks end-to-end. Claude Opus leads at 72.5%. DeepSeek V3.2 hits 77.8% on the verified subset. Gemma 3 27B drops to ~30%. By the time you get to 4B models, tool-use reliability falls off a cliff.

An agent needs to correctly format tool calls, handle errors gracefully, maintain context across dozens of turns, and make judgment calls about when to act vs. when to ask. Frontier models do this consistently. Local models do it sometimes — and “sometimes” means your agent breaks at 2 AM and you wake up to a mess.

We run Claude Opus as our primary agent model through Anthropic’s API. For client deployments, we use OpenAI’s Codex platform. Not because local models are bad — they’re genuinely impressive for their size. But agent reasoning is the one workload where “95% reliable” isn’t good enough. You need 99%+, and right now only frontier models deliver that.

A Note on the Chinese Models

DeepSeek V3.2 deserves special attention. It achieves the highest SWE-Bench score of any model at 77.8%, at roughly one-thirtieth the cost of GPT-5.2 Pro. Qwen 3.5 from Alibaba is similarly impressive. These models are forcing every Western lab to reconsider pricing.

For agent reasoning specifically, the jury is still out. Tool-use reliability and instruction following in agentic contexts haven’t been benchmarked as rigorously as coding tasks. But the trajectory is clear: the cost of frontier-grade intelligence is collapsing. What costs $75/M output tokens today will cost $5 within a year.

Where Local Models Shine

Here’s the thing the local-model evangelists get right: you don’t need Opus for everything. Most of the compute in a well-designed agent pipeline isn’t agent reasoning — it’s grunt work. And local models eat grunt work for breakfast.

Our actual pipeline:

Transcription — mlx-whisper (Local, GPU)

Every voice memo, podcast, and video gets transcribed on a Mac mini using mlx-whisper with Metal acceleration. A 35-minute recording transcribes in ~90 seconds. No audio leaves the machine. No API cost. No privacy concerns.

Audio in → mlx-whisper (local GPU) → timestamped transcript → file

Embeddings — nomic-embed-text via Ollama (Local)

Our semantic memory search uses nomic-embed-text (137M parameters, 274MB on disk) running through Ollama. Every memory file, daily log, and project note gets embedded locally. Search queries hit local vectors. Zero API cost, instant response, fully private.

Text → nomic-embed-text (Ollama) → vector → local search index

Classification — Gemma 3 4B via Ollama (Local)

Our feed digest system classifies ~100 RSS items per run through a local Gemma 3 4B model. It sorts articles into “must-read,” “high-interest,” and “skip” tiers based on interest profiles. The model is small enough to run alongside everything else without GPU contention.

RSS items → Gemma 3 4B (Ollama) → tier classification → digest

Content Processing — Gemma 3 4B (Local)

Tagging vault clippings, summarizing articles for notes, generating metadata — all local. These are pattern-matching tasks where a 4B model performs adequately and the volume makes API calls expensive.

Agent Reasoning — Claude Opus (Cloud)

The actual agent — reading context, selecting tools, making decisions, writing responses, managing state across sessions — runs on Opus via Anthropic’s API. This is the only workload that touches the cloud, and it’s the only one that needs to.

The Math

Here’s why this architecture matters economically:

A typical day in our system:

  • ~50 transcription jobs (voice memos, podcast clips): $0 (local)
  • ~1,000 embedding operations (memory search): $0 (local)
  • ~200 classification calls (feed digest, tagging): $0 (local)
  • ~100 agent reasoning turns (actual decision-making): variable (Anthropic API)

If we ran everything through Opus, we’d burn through API credits in days. By routing 90% of compute to local models, the cloud budget goes entirely toward the workload that actually needs it.

The Hardware Reality

“But I have GPUs!” Sure. Here’s the problem:

Running a 70B parameter model (the minimum for borderline agent reasoning) requires:

  • ~40GB of VRAM at 4-bit quantization
  • ~140GB of VRAM at full precision
  • Sustained throughput of 30+ tokens/second for acceptable response times

A single RTX 4090 (24GB VRAM) can run a 70B model at aggressive quantization, but slowly — maybe 10-15 tokens/second. That’s painful for an interactive agent. You’d need 2-3 cards for tolerable speed, and even then you’re getting quantized 70B performance, not Opus performance.

An M4 Max Mac Studio (128GB unified memory) can run 70B models at decent speed because the entire model fits in memory without quantization trade-offs. That’s the $4K+ option.

For comparison: Claude Opus via API costs roughly $0.02 per complex agent turn. You’d need to make ~200,000 agent turns to justify the hardware cost of a Mac Studio dedicated to local inference. At 100 turns/day, that’s over 5 years to break even — and by then the models will have changed three times.

The economics are clear: rent the frontier, own the grunt work.

What We Actually Run

Our full stack on a single M4 Mac mini (base model, 24GB RAM):

ServiceModelLocationPurpose
Agent reasoningClaude Opus 4Anthropic APIDecision-making, tool use, responses
Client agentsCodex (GPT-5.2)OpenAI APIClient hatchling deployments
Transcriptionwhisper-small-mlxLocal GPUVoice memos, podcasts, video
Embeddingsnomic-embed-textOllama (local)Semantic memory search
ClassificationGemma 3 4BOllama (local)Feed digest, content tagging
Local completionsGemma 3 4BOllama (local)Summaries, metadata, formatting
Image generationDALL-E 3OpenAI APIBlog graphics, assets

Total local model footprint: ~4GB. Leaves plenty of headroom for the agent process, Ollama, and everything else running on the machine.

Recommendations

If you’re deploying an OpenClaw agent:

  1. Use a frontier model for agent reasoning. We run Claude Opus (Anthropic) for our own agent and Codex (OpenAI) for client deployments. The reliability gap between frontier models and local alternatives is real, and it matters most for the workload where failures are most visible.

  2. Install Ollama for everything else. Transcription, embeddings, classification, summarization — local models handle these at zero marginal cost with full privacy.

  3. Don’t try to run sub-agents on local models. Sub-agents need the same reasoning reliability as the main agent. If your sub-agent hallucinates a tool call, you’ve created a mess that the main agent has to clean up.

  4. Match the model to the task. A 4B model is perfect for classification. It’s terrible for multi-step planning. Know the boundary.

  5. Watch the Chinese models. DeepSeek and Qwen are closing the gap fast at dramatically lower costs. When their tool-use reliability matches their benchmark scores, the economics of this whole stack shift. We’re watching closely.

  6. Measure before optimizing. Track your actual API spend. If 90% of your tokens are going to summarization tasks, move those to local models. If 90% are agent reasoning, you’re already optimized — the cloud spend is justified.

The goal isn’t to eliminate cloud AI. It’s to be strategic about what goes where. Frontier reasoning in the cloud, grunt work on the edge. That’s how you build a system that’s both capable and sustainable.


We’re building AI agent infrastructure at wade.digital. If your deploy is fighting you, find us on Bluesky — we’ve probably broken it the same way and fixed it already.