How to Build Applications with Agentic Backends — The Architecture That Actually Ships

The Question

You want to build an application — mobile, web, whatever — where the backend is an AI agent. Not a chatbot. An agent that reads files, runs commands, makes decisions, calls tools. Something like Donna (Eric’s personal CRM assistant running on Cursor/OpenClaw) but exposed as an API that multiple users can hit from their phones.

What’s the best architecture? Should you use Cursor’s headless CLI? Wrap Claude in a while-loop? Use a framework like LangGraph? Pay for Devin’s API? Roll your own?

We just proved cursor-agent -p -f works headless. But “works on my Mac” and “works as a product” are different questions. This thesis maps the landscape, names the real options, and picks one.

Players Mapped

50+

5 architecture layers

Production Cases

40+

shipped, not demos

Live Debates

practitioner discourse

Verdict

Loop

not framework

Eric’s Stake

This isn’t academic. Eric is building multiple products that need agentic backends right now:

Donna — personal CRM assistant, currently running on Cursor IDE + OpenClaw on WhatsApp. Needs to be a product others can use, not just Eric’s local Mac.
Sourcy — WhatsApp activation bot for a sourcing platform. Currently OpenClaw + Claude Opus in Docker. Daniel (CTO) wants Claude Code Agent SDK instead.
claw.degree — agent evaluation service. The evaluator IS an agent that tests other agents via conversation.
Beans Family PA — family assistant pilot. Needs multi-user, always-on, WhatsApp-integrated backend.

The contrarian angle: most discourse treats this as a framework selection problem. It’s not. It’s an architecture pattern question — and the answer is simpler than the industry wants you to believe.

I. The Five Architecture Layers

The first mistake everyone makes is treating “agentic backend” as one category. There are five distinct layers, and conflating them leads to bad decisions.

Layer	What It Does	Leaders	Key Numbers
1. Model API	Raw LLM with tool_use / function calling	Claude API, GPT API, Gemini	Anthropic $14B ARR¹; OpenAI Agents SDK 18.9K stars²
2. Agent Framework	Orchestration, memory, multi-step workflows	LangGraph, CrewAI, Mastra	LangChain 127K stars, $135M raised³; CrewAI 43.9K stars⁴
3. IDE/CLI Agent	Code-native agent with file/terminal access	Cursor, Claude Code, Codex CLI	Cursor $1B ARR⁵; Claude Code ~$2.5B ARR contribution⁶
4. Infrastructure	Sandboxing, browsers, tools, compute	E2B, Browserbase, Composio	E2B $21M raised⁷; Composio 26.5K stars, $29M raised⁸
5. Runtime/Gateway	Persistent agent hosting, messaging channels	OpenClaw, AgentProtocol	OpenClaw 193K stars⁹; MCP 78.6K stars¹⁰

The insight: You don’t pick one layer — you pick one from each layer that matters for your use case. The question is which layers you can skip.

II. The Scoreboard — Hard Numbers

Framework GitHub Stars Race

LangChain

127K

AutoGen (MS)

54.5K

CrewAI

43.9K

smolagents (HF)

25.4K

LangGraph

24.7K

Vercel AI SDK

21.7K

Mastra (YC)

20.9K

Pydantic AI

14.9K

Revenue Leaders (Agentic Products)

Cursor (IDE)

$1B ARR

Harvey (Legal)

$195M

Manus (General)

$100M

Sierra (CX)

$100M

Devin (Code)

$73M

Pattern: Every product above $50M ARR shares three traits: domain-specific, human-in-the-loop, and embedded in existing workflows. None are “general-purpose autonomous agents.”

III. The Four Architectures That Exist

Strip away marketing and there are exactly four ways people build agentic backends today. Here they are, from simplest to most complex:

Architecture A: The While-Loop RECOMMENDED

A single LLM in a tool-calling loop. No framework. No orchestration layer. Just: send message → model responds with tool calls → execute tools → send results back → repeat until done.

Who uses this: Cursor ($1B ARR) is fundamentally this architecture⁵. Anthropic themselves recommend it: “Start by using LLM APIs directly.”¹¹ Braintrust formalized it as the “canonical agent architecture”¹². Sketch.dev runs a full AI programming assistant on 9 lines of code¹³. Octomind dropped LangChain after 12 months in production for direct API calls — “Once we removed it, we could just code.”¹⁴

Cost: $0.05–0.30 per turn (Claude Sonnet). Latency: 2–15s per tool cycle. Complexity: ~100–300 lines of code.

Architecture B: Framework-Orchestrated

Use LangGraph, CrewAI, or Mastra to manage state, routing, multi-step workflows, and human-in-the-loop patterns.

The honest case: Harrison Chase (LangChain CEO) argues the real value is the runtime layer — durability, tracing, human-in-the-loop, resumability¹⁵. He’s not wrong for complex enterprise workflows. But 1/3 of Fortune 500 use LangChain³ — that’s enterprise inertia more than technical superiority.

When it makes sense: Multi-agent coordination, durable long-running tasks, compliance-heavy workflows with audit trails. When it doesn’t: Anything a single agent can handle in one conversation.

Architecture C: IDE/CLI-as-Backend

Use Cursor’s headless CLI or Claude Code’s SDK to run code-capable agents. This is what we just tested:

echo "" | cursor-agent -p -f --output-format json "your prompt"

The limitation: These are code-first tools. Cursor’s CLI is optimized for repo operations. Claude Code costs ~$6/dev/day⁶. Devin charges $2.25/ACU for production API access¹⁶. None are designed for high-concurrency multi-user app backends. They’re for dev automation and CI/CD, not customer-facing products.

Architecture D: Runtime/Gateway

Deploy a persistent agent on a runtime like OpenClaw that handles messaging channels (WhatsApp, Telegram, Discord), session management, and long-running state.

What OpenClaw gives you: WhatsApp/Telegram/Discord integration, persistent agent process, plugin system, Docker deployment. What it costs you: Known instability (message loss during gateway restarts), fast-moving project with breaking changes, CVE security concerns⁹.

IV. The 7 Live Debates

Debate 1: Frameworks vs “Just Use the API”

Pro Framework

LangGraph adds durability, tracing, human-in-the-loop — things you’ll rebuild anyway¹⁵
1/3 Fortune 500 use LangChain³ — enterprise needs these abstractions
Multi-agent coordination is genuinely hard to hand-roll

Anti Framework

Octomind ripped out LangChain after 12 months: “Once we removed it, we could just code”¹⁴
Anthropic’s own guide says start with direct API calls¹¹
HN consensus near-unanimous against LangChain specifically
Sketch.dev: full agent in 9 lines¹³

Our position: The anti-framework camp is right for most applications. The pro-framework camp is right for complex enterprise workflows. The mistake is thinking you need to decide upfront. Start with a while-loop. Add framework when you hit a specific wall (durability, multi-agent, audit trails). Most people never hit that wall.

Debate 2: Cursor/Claude Code as App Backend

Cursor shipped a Background Agents API and headless CLI. Claude Code has a full programmatic SDK. This raises the question: can you use coding agents as general-purpose app backends?

What works: Dev automation, CI/CD integration, cron-triggered tasks, internal tools.

What doesn’t: High-concurrency customer-facing apps. These tools are per-developer priced, code-optimized, and single-tenant by design. Running 1,000 concurrent user sessions on Cursor CLI is neither designed nor priced for that.

Our position: Use Cursor CLI for your own automation (Eric’s prmupdate, deploys, research runs). Don’t use it as the production backend for a multi-user app.

Debate 3: MCP — Standard or Hype?

Model Context Protocol has 78.6K stars and adoption from every major vendor¹⁰. But a 623-point HN post called out the spec as apparently LLM-generated¹⁷, transport layer should be WebSockets not SSE-on-SSE, security is an afterthought, and only 16% task completion on benchmarks.

Our position: MCP is winning by default, not by merit. As one HN commenter said: “I take a bad standard that can evolve, over no standard at all.” It moved to Linux Foundation governance. Build MCP-compatible tools. Don’t bet your architecture on MCP internals being stable.

Debate 4: Cost Reality

The biggest surprise in our research. A single “agentic” user request triggers 8–15 internal LLM calls. One team budgeted $4K/month and hit $11.2K in 3 weeks because of recursive loops¹⁸.

Task Type	Cost/Request	Calls/Request
Simple chat + 1 tool	$0.02–0.06	2–3
Research task	$0.15–0.50	5–10
Code audit	$1.00–5.85	10–25
Autonomous multi-step	$2.00–15.00+	15–50+

Rule: Always hard-cap decision loops. No agent gets unlimited retries. Budget per-task, not per-month.

Debate 5: Stateful vs Stateless

“Just vector DB it” fails at scale (mutation, deletion, selective recall). The production pattern emerging: structured JSON state + summarize-and-compress for context window. For durable long-running tasks: Temporal or Inngest. Write paths matter more than read paths.

Debate 6: The “Agentic” Label

HFS Research coined “agentic-washing.”¹⁹ AI21 called most agents “glorified if-else statements.” Klarna’s AI customer service reversal is the canonical proof point — CEO admitted quality tanked and re-hired humans²⁰. 78% of enterprise agent pilots didn’t reach production.²¹

Debate 7: Build vs Buy

The market is bifurcating: simple agent tasks (chat + tools) are commoditizing toward direct API calls. Complex orchestration (multi-agent, durable workflows) is consolidating around LangGraph and Temporal. The middle ground — “framework that wraps a simple agent” — is dying.

V. The Graveyard — What Failed

Architecture	What Happened	Lesson
Fully autonomous agents (AutoGPT, BabyAGI)	Loops burned money, produced garbage	Humans in the loop aren’t optional
Multi-agent communication	“Politeness loops” — agents thanking each other, $4K wasted¹⁸	Agent-to-agent is a research problem, not a product pattern
AI hardware (Humane Pin, Rabbit R1)	Both dying — no distribution, no ecosystem	Hardware needs software moat first
Framework-heavy builds	Octomind: 12 months on LangChain, ripped it out¹⁴	Frameworks add friction before they add value
Dual foundation-model-plus-agent	Adept acqui-hired by Amazon	Don’t train your own model AND build the agent
Standalone agent products	OpenAI Operator folded into ChatGPT after 7 months	Agent products get absorbed into platforms

VI. Red Team — Stress-Testing the “Just a While-Loop” Thesis

For: The Loop Wins

Cursor ($1B ARR) is this pattern at scale
Anthropic, model creator, recommends it
Production teardowns all converge here
100–300 lines of code. Full control. No dependency risk
Continuous Claude (a Bash while loop) outperforms most frameworks

Against: The Loop Isn’t Enough

No built-in durability — if your server crashes mid-task, state is lost
No audit trail, tracing, or replay without building it yourself
Human-in-the-loop patterns need coordination your loop doesn’t have
Multi-user concurrency requires session management you’ll hand-roll
Enterprise compliance (SOC2, HIPAA) needs logging LangSmith gives you

Grade: 70% insight, 30% oversimplification. The while-loop thesis is correct for most builders today. But it understates the operational overhead. You won’t use a framework for the LLM orchestration — but you’ll want infrastructure for observability, session management, and crash recovery. Those are infrastructure problems, not framework problems.

VII. The Decision Framework

Stop asking “which framework.” Ask these three questions:

Q1: How many users?

Users	Architecture	Deploy
Just me	Cursor CLI headless (`cursor-agent -p -f`)	Your Mac + cron/webhook
1–10 (pilot)	While-loop + Claude API + your tools	Railway / Fly.io / VPS
10–1000	While-loop + Claude API + session store (Redis/Supabase)	Railway / Fly.io
1000+	While-loop + Claude API + Temporal/Inngest for durability	AWS/GCP with auto-scaling

Q2: What messaging channel?

Channel	Best Option	Notes
WhatsApp	OpenClaw (free) or Twilio + custom (paid)	OpenClaw has gateway instability; Twilio is reliable but $$$
Web / Mobile app	Custom API (Express/Fastify)	Full control, cleanest architecture
Telegram	Bot API is free, reliable, well-documented	Easiest channel to start with
Discord	Discord.js + your agent loop	Good for communities

Q3: What tools does the agent need?

Need	Best Option	Cost
Read/write files	Direct filesystem or S3	Free / pennies
Run code	E2B sandboxes	$21/mo base⁷
Browse web	Browserbase / Stagehand	$99/mo²²
Many external APIs	Composio (800+ toolkits)	Free tier available⁸
Custom tools	MCP servers (build your own)	Free (open protocol)

VIII. What This Means for Eric

Keep Doing

Conversation design as portable asset. SOUL.md + AGENTS.md is the real IP. It works with Cursor, OpenClaw, Claude API, anything. Daniel was right that OpenClaw is risky — but the conversation design survived the framework debate because it’s framework-agnostic.
Cursor CLI for personal automation. cursor-agent -p -f for cron jobs, prmupdate, research runs. This is the right tool for Eric-only workflows.
MCP for tool interop. Build tools as MCP servers. They work in Cursor, Claude Code, and any future client.

Reconsider

OpenClaw as production runtime for Sourcy/Donna. Known instability, CVE concerns, and Daniel’s objection has merit (“400K lines, not controllable”). For customer-facing products, a simple Claude API while-loop on Railway is more reliable and Eric controls the stack.
Cursor as multi-user backend. We proved it works. But it’s per-developer priced and single-tenant. For Beans Family PA or Donna-as-product, build a lightweight API server that calls Claude directly.

The One Thing That Changes Everything

Claude’s extended thinking + tool_use becoming real-time. Today there’s a ~2–5s latency per tool cycle. When that drops to sub-second (inference optimization + edge deployment), the “agent as backend” becomes indistinguishable from a traditional API. That’s when agents stop being a UX compromise and start being the default architecture. Timeline: 12–18 months.

The Recommended Stack for Eric’s Projects

Donna / Beans Family PA / Sourcy (multi-user, customer-facing):

Model: Claude Sonnet 4 via API (tool_use)
Pattern: While-loop (200 lines of JS)
State: Supabase (you already have it) for conversations + structured JSON
Channel: WhatsApp via Twilio (reliable) or OpenClaw (free, riskier)
Identity: SOUL.md + AGENTS.md loaded as system prompt per user
Deploy: Railway or Fly.io ($5–20/mo)
Cost: ~$0.05–0.15 per interaction. At 100 daily users × 5 interactions = $25–75/day

Eric-only automation (prmupdate, research, deploys):

Tool: cursor-agent -p -f (already working)
Trigger: cron, webhook, or manual from phone via simple HTTP wrapper
Deploy: Your Mac (it’s already on)

Verdict

The winning architecture for agentic backends is a while-loop calling a frontier model with tool_use. Not a framework. Not an IDE-as-backend. Not a multi-agent swarm. A loop.

Every product above $50M ARR uses this pattern. Cursor is this pattern. Anthropic recommends this pattern. Teams that started with frameworks are ripping them out.

The framework layer has value for enterprise orchestration (durability, audit trails, multi-agent). But 90% of builders — including Eric — don’t need it. Build the loop, add infrastructure (observability, session management) when you need it, and invest your real engineering time in the thing no framework gives you: domain-specific conversation design, tools, and data.

The moat is not the agent architecture. The moat is the SOUL.md.

One-sentence version: “The best agentic backend is 200 lines of code calling Claude with your tools — everything else is a premature abstraction.”

Sources

[1] Anthropic revenue figures — $14B ARR, $380B valuation (Feb 2026)

[2] OpenAI Agents SDK — 18.9K GitHub stars, 9.3M monthly PyPI downloads

[3] LangChain / LangSmith — 127K stars, $135M raised at $1.25B valuation, 1/3 Fortune 500

[4] CrewAI — 43.9K stars, $18M raised

[5] Cursor / Anysphere — $1B+ ARR, $29.3B valuation

[6] Claude Code — ~$6/dev/day, full programmatic SDK, contributing to $2.5B ARR

[7] E2B — 10.9K stars, $21M Series A, used by Perplexity + HuggingFace

[8] Composio — 26.5K stars, $29M raised, 800+ toolkits

[9] OpenClaw (open-source agent runtime) — 193K stars, CVE security concerns, gateway instability

[10] MCP Servers repo — 78.6K stars, Linux Foundation governance, de facto tool interop standard

[11] Anthropic: Building Agents — “Start by using LLM APIs directly”

[12] Braintrust: Canonical Agent Architecture — while-loop formalized as the standard pattern

[13] Sketch.dev — full AI programming assistant in 9 lines of code

[14] Octomind: Why We No Longer Use LangChain — 12-month production teardown

[15] Harrison Chase on LangGraph runtime value — durability, tracing, human-in-the-loop as the real product

[16] Devin pricing — $20/mo entry (was $500), $2.25/ACU production, $10.2B valuation

[17] HN: MCP spec criticism — 623 points, spec called LLM-generated, transport layer concerns

[18] Reddit r/LocalLLaMA cost reports — $4K budget hit $11.2K in 3 weeks from recursive agent loops

[19] HFS Research: Agentic Washing — coined the term for fake agent claims

[20] Klarna AI reversal — CEO admitted quality tanked, re-hired human agents

[21] Enterprise agent pilot failure rate — 78% didn’t reach production

[22] Browserbase pricing — $99/mo base, Stagehand 21K stars, $40M raised