How to Build Applications with Agentic Backends

The winning architecture is boring. Here's what actually ships.
14 February 2026 · Deep Thesis

The Question

You want to build an application — mobile, web, whatever — where the backend is an AI agent. Not a chatbot. An agent that reads files, runs commands, makes decisions, calls tools. Something like Donna (Eric’s personal CRM assistant running on Cursor/OpenClaw) but exposed as an API that multiple users can hit from their phones.

What’s the best architecture? Should you use Cursor’s headless CLI? Wrap Claude in a while-loop? Use a framework like LangGraph? Pay for Devin’s API? Roll your own?

We just proved cursor-agent -p -f works headless. But “works on my Mac” and “works as a product” are different questions. This thesis maps the landscape, names the real options, and picks one.

Players Mapped
50+
5 architecture layers
Production Cases
40+
shipped, not demos
Live Debates
7
practitioner discourse
Verdict
Loop
not framework

Eric’s Stake

This isn’t academic. Eric is building multiple products that need agentic backends right now:

The contrarian angle: most discourse treats this as a framework selection problem. It’s not. It’s an architecture pattern question — and the answer is simpler than the industry wants you to believe.

I. The Five Architecture Layers

The first mistake everyone makes is treating “agentic backend” as one category. There are five distinct layers, and conflating them leads to bad decisions.

LayerWhat It DoesLeadersKey Numbers
1. Model API Raw LLM with tool_use / function calling Claude API, GPT API, Gemini Anthropic $14B ARR1; OpenAI Agents SDK 18.9K stars2
2. Agent Framework Orchestration, memory, multi-step workflows LangGraph, CrewAI, Mastra LangChain 127K stars, $135M raised3; CrewAI 43.9K stars4
3. IDE/CLI Agent Code-native agent with file/terminal access Cursor, Claude Code, Codex CLI Cursor $1B ARR5; Claude Code ~$2.5B ARR contribution6
4. Infrastructure Sandboxing, browsers, tools, compute E2B, Browserbase, Composio E2B $21M raised7; Composio 26.5K stars, $29M raised8
5. Runtime/Gateway Persistent agent hosting, messaging channels OpenClaw, AgentProtocol OpenClaw 193K stars9; MCP 78.6K stars10
The insight: You don’t pick one layer — you pick one from each layer that matters for your use case. The question is which layers you can skip.

II. The Scoreboard — Hard Numbers

Framework GitHub Stars Race

LangChain
127K
AutoGen (MS)
54.5K
CrewAI
43.9K
smolagents (HF)
25.4K
LangGraph
24.7K
Vercel AI SDK
21.7K
Mastra (YC)
20.9K
Pydantic AI
14.9K

Revenue Leaders (Agentic Products)

Cursor (IDE)
$1B ARR
Harvey (Legal)
$195M
Manus (General)
$100M
Sierra (CX)
$100M
Devin (Code)
$73M
Pattern: Every product above $50M ARR shares three traits: domain-specific, human-in-the-loop, and embedded in existing workflows. None are “general-purpose autonomous agents.”

III. The Four Architectures That Exist

Strip away marketing and there are exactly four ways people build agentic backends today. Here they are, from simplest to most complex:

Architecture A: The While-Loop RECOMMENDED

A single LLM in a tool-calling loop. No framework. No orchestration layer. Just: send message → model responds with tool calls → execute tools → send results back → repeat until done.

Who uses this: Cursor ($1B ARR) is fundamentally this architecture5. Anthropic themselves recommend it: “Start by using LLM APIs directly.”11 Braintrust formalized it as the “canonical agent architecture”12. Sketch.dev runs a full AI programming assistant on 9 lines of code13. Octomind dropped LangChain after 12 months in production for direct API calls — “Once we removed it, we could just code.”14

Cost: $0.05–0.30 per turn (Claude Sonnet). Latency: 2–15s per tool cycle. Complexity: ~100–300 lines of code.

Architecture B: Framework-Orchestrated

Use LangGraph, CrewAI, or Mastra to manage state, routing, multi-step workflows, and human-in-the-loop patterns.

The honest case: Harrison Chase (LangChain CEO) argues the real value is the runtime layer — durability, tracing, human-in-the-loop, resumability15. He’s not wrong for complex enterprise workflows. But 1/3 of Fortune 500 use LangChain3 — that’s enterprise inertia more than technical superiority.

When it makes sense: Multi-agent coordination, durable long-running tasks, compliance-heavy workflows with audit trails. When it doesn’t: Anything a single agent can handle in one conversation.

Architecture C: IDE/CLI-as-Backend

Use Cursor’s headless CLI or Claude Code’s SDK to run code-capable agents. This is what we just tested:

echo "" | cursor-agent -p -f --output-format json "your prompt"

The limitation: These are code-first tools. Cursor’s CLI is optimized for repo operations. Claude Code costs ~$6/dev/day6. Devin charges $2.25/ACU for production API access16. None are designed for high-concurrency multi-user app backends. They’re for dev automation and CI/CD, not customer-facing products.

Architecture D: Runtime/Gateway

Deploy a persistent agent on a runtime like OpenClaw that handles messaging channels (WhatsApp, Telegram, Discord), session management, and long-running state.

What OpenClaw gives you: WhatsApp/Telegram/Discord integration, persistent agent process, plugin system, Docker deployment. What it costs you: Known instability (message loss during gateway restarts), fast-moving project with breaking changes, CVE security concerns9.

IV. The 7 Live Debates

Debate 1: Frameworks vs “Just Use the API”

Pro Framework

  • LangGraph adds durability, tracing, human-in-the-loop — things you’ll rebuild anyway15
  • 1/3 Fortune 500 use LangChain3 — enterprise needs these abstractions
  • Multi-agent coordination is genuinely hard to hand-roll

Anti Framework

  • Octomind ripped out LangChain after 12 months: “Once we removed it, we could just code”14
  • Anthropic’s own guide says start with direct API calls11
  • HN consensus near-unanimous against LangChain specifically
  • Sketch.dev: full agent in 9 lines13

Our position: The anti-framework camp is right for most applications. The pro-framework camp is right for complex enterprise workflows. The mistake is thinking you need to decide upfront. Start with a while-loop. Add framework when you hit a specific wall (durability, multi-agent, audit trails). Most people never hit that wall.

Debate 2: Cursor/Claude Code as App Backend

Cursor shipped a Background Agents API and headless CLI. Claude Code has a full programmatic SDK. This raises the question: can you use coding agents as general-purpose app backends?

What works: Dev automation, CI/CD integration, cron-triggered tasks, internal tools.

What doesn’t: High-concurrency customer-facing apps. These tools are per-developer priced, code-optimized, and single-tenant by design. Running 1,000 concurrent user sessions on Cursor CLI is neither designed nor priced for that.

Our position: Use Cursor CLI for your own automation (Eric’s prmupdate, deploys, research runs). Don’t use it as the production backend for a multi-user app.

Debate 3: MCP — Standard or Hype?

Model Context Protocol has 78.6K stars and adoption from every major vendor10. But a 623-point HN post called out the spec as apparently LLM-generated17, transport layer should be WebSockets not SSE-on-SSE, security is an afterthought, and only 16% task completion on benchmarks.

Our position: MCP is winning by default, not by merit. As one HN commenter said: “I take a bad standard that can evolve, over no standard at all.” It moved to Linux Foundation governance. Build MCP-compatible tools. Don’t bet your architecture on MCP internals being stable.

Debate 4: Cost Reality

The biggest surprise in our research. A single “agentic” user request triggers 8–15 internal LLM calls. One team budgeted $4K/month and hit $11.2K in 3 weeks because of recursive loops18.

Task TypeCost/RequestCalls/Request
Simple chat + 1 tool$0.02–0.062–3
Research task$0.15–0.505–10
Code audit$1.00–5.8510–25
Autonomous multi-step$2.00–15.00+15–50+

Rule: Always hard-cap decision loops. No agent gets unlimited retries. Budget per-task, not per-month.

Debate 5: Stateful vs Stateless

“Just vector DB it” fails at scale (mutation, deletion, selective recall). The production pattern emerging: structured JSON state + summarize-and-compress for context window. For durable long-running tasks: Temporal or Inngest. Write paths matter more than read paths.

Debate 6: The “Agentic” Label

HFS Research coined “agentic-washing.”19 AI21 called most agents “glorified if-else statements.” Klarna’s AI customer service reversal is the canonical proof point — CEO admitted quality tanked and re-hired humans20. 78% of enterprise agent pilots didn’t reach production.21

Debate 7: Build vs Buy

The market is bifurcating: simple agent tasks (chat + tools) are commoditizing toward direct API calls. Complex orchestration (multi-agent, durable workflows) is consolidating around LangGraph and Temporal. The middle ground — “framework that wraps a simple agent” — is dying.

V. The Graveyard — What Failed

ArchitectureWhat HappenedLesson
Fully autonomous agents (AutoGPT, BabyAGI)Loops burned money, produced garbageHumans in the loop aren’t optional
Multi-agent communication“Politeness loops” — agents thanking each other, $4K wasted18Agent-to-agent is a research problem, not a product pattern
AI hardware (Humane Pin, Rabbit R1)Both dying — no distribution, no ecosystemHardware needs software moat first
Framework-heavy buildsOctomind: 12 months on LangChain, ripped it out14Frameworks add friction before they add value
Dual foundation-model-plus-agentAdept acqui-hired by AmazonDon’t train your own model AND build the agent
Standalone agent productsOpenAI Operator folded into ChatGPT after 7 monthsAgent products get absorbed into platforms

VI. Red Team — Stress-Testing the “Just a While-Loop” Thesis

For: The Loop Wins

  • Cursor ($1B ARR) is this pattern at scale
  • Anthropic, model creator, recommends it
  • Production teardowns all converge here
  • 100–300 lines of code. Full control. No dependency risk
  • Continuous Claude (a Bash while loop) outperforms most frameworks

Against: The Loop Isn’t Enough

  • No built-in durability — if your server crashes mid-task, state is lost
  • No audit trail, tracing, or replay without building it yourself
  • Human-in-the-loop patterns need coordination your loop doesn’t have
  • Multi-user concurrency requires session management you’ll hand-roll
  • Enterprise compliance (SOC2, HIPAA) needs logging LangSmith gives you

Grade: 70% insight, 30% oversimplification. The while-loop thesis is correct for most builders today. But it understates the operational overhead. You won’t use a framework for the LLM orchestration — but you’ll want infrastructure for observability, session management, and crash recovery. Those are infrastructure problems, not framework problems.

VII. The Decision Framework

Stop asking “which framework.” Ask these three questions:

Q1: How many users?

UsersArchitectureDeploy
Just meCursor CLI headless (cursor-agent -p -f)Your Mac + cron/webhook
1–10 (pilot)While-loop + Claude API + your toolsRailway / Fly.io / VPS
10–1000While-loop + Claude API + session store (Redis/Supabase)Railway / Fly.io
1000+While-loop + Claude API + Temporal/Inngest for durabilityAWS/GCP with auto-scaling

Q2: What messaging channel?

ChannelBest OptionNotes
WhatsAppOpenClaw (free) or Twilio + custom (paid)OpenClaw has gateway instability; Twilio is reliable but $$$
Web / Mobile appCustom API (Express/Fastify)Full control, cleanest architecture
TelegramBot API is free, reliable, well-documentedEasiest channel to start with
DiscordDiscord.js + your agent loopGood for communities

Q3: What tools does the agent need?

NeedBest OptionCost
Read/write filesDirect filesystem or S3Free / pennies
Run codeE2B sandboxes$21/mo base7
Browse webBrowserbase / Stagehand$99/mo22
Many external APIsComposio (800+ toolkits)Free tier available8
Custom toolsMCP servers (build your own)Free (open protocol)

VIII. What This Means for Eric

Keep Doing

Reconsider

The One Thing That Changes Everything

Claude’s extended thinking + tool_use becoming real-time. Today there’s a ~2–5s latency per tool cycle. When that drops to sub-second (inference optimization + edge deployment), the “agent as backend” becomes indistinguishable from a traditional API. That’s when agents stop being a UX compromise and start being the default architecture. Timeline: 12–18 months.

The Recommended Stack for Eric’s Projects

Donna / Beans Family PA / Sourcy (multi-user, customer-facing):
  • Model: Claude Sonnet 4 via API (tool_use)
  • Pattern: While-loop (200 lines of JS)
  • State: Supabase (you already have it) for conversations + structured JSON
  • Channel: WhatsApp via Twilio (reliable) or OpenClaw (free, riskier)
  • Identity: SOUL.md + AGENTS.md loaded as system prompt per user
  • Deploy: Railway or Fly.io ($5–20/mo)
  • Cost: ~$0.05–0.15 per interaction. At 100 daily users × 5 interactions = $25–75/day
Eric-only automation (prmupdate, research, deploys):
  • Tool: cursor-agent -p -f (already working)
  • Trigger: cron, webhook, or manual from phone via simple HTTP wrapper
  • Deploy: Your Mac (it’s already on)

Verdict

The winning architecture for agentic backends is a while-loop calling a frontier model with tool_use. Not a framework. Not an IDE-as-backend. Not a multi-agent swarm. A loop.

Every product above $50M ARR uses this pattern. Cursor is this pattern. Anthropic recommends this pattern. Teams that started with frameworks are ripping them out.

The framework layer has value for enterprise orchestration (durability, audit trails, multi-agent). But 90% of builders — including Eric — don’t need it. Build the loop, add infrastructure (observability, session management) when you need it, and invest your real engineering time in the thing no framework gives you: domain-specific conversation design, tools, and data.

The moat is not the agent architecture. The moat is the SOUL.md.

One-sentence version: “The best agentic backend is 200 lines of code calling Claude with your tools — everything else is a premature abstraction.”

Sources

[1] Anthropic revenue figures — $14B ARR, $380B valuation (Feb 2026)
[2] OpenAI Agents SDK — 18.9K GitHub stars, 9.3M monthly PyPI downloads
[3] LangChain / LangSmith — 127K stars, $135M raised at $1.25B valuation, 1/3 Fortune 500
[4] CrewAI — 43.9K stars, $18M raised
[5] Cursor / Anysphere — $1B+ ARR, $29.3B valuation
[6] Claude Code — ~$6/dev/day, full programmatic SDK, contributing to $2.5B ARR
[7] E2B — 10.9K stars, $21M Series A, used by Perplexity + HuggingFace
[8] Composio — 26.5K stars, $29M raised, 800+ toolkits
[9] OpenClaw (open-source agent runtime) — 193K stars, CVE security concerns, gateway instability
[10] MCP Servers repo — 78.6K stars, Linux Foundation governance, de facto tool interop standard
[11] Anthropic: Building Agents — “Start by using LLM APIs directly”
[12] Braintrust: Canonical Agent Architecture — while-loop formalized as the standard pattern
[13] Sketch.dev — full AI programming assistant in 9 lines of code
[14] Octomind: Why We No Longer Use LangChain — 12-month production teardown
[15] Harrison Chase on LangGraph runtime value — durability, tracing, human-in-the-loop as the real product
[16] Devin pricing — $20/mo entry (was $500), $2.25/ACU production, $10.2B valuation
[17] HN: MCP spec criticism — 623 points, spec called LLM-generated, transport layer concerns
[18] Reddit r/LocalLLaMA cost reports — $4K budget hit $11.2K in 3 weeks from recursive agent loops
[19] HFS Research: Agentic Washing — coined the term for fake agent claims
[20] Klarna AI reversal — CEO admitted quality tanked, re-hired human agents
[21] Enterprise agent pilot failure rate — 78% didn’t reach production
[22] Browserbase pricing — $99/mo base, Stagehand 21K stars, $40M raised