I. The Opportunity
The thesis: short-form video (TikTok, IG Reels, YouTube Shorts) is the dominant content format, but AI agents and workflows can’t natively process video. They need text. Converting video→text requires extracting frames, running vision models, transcribing audio — and at scale, this is brutally expensive.1
The proposed business: batch-process millions of videos using economies of scale (bulk API pricing, GPU batching, cached embeddings), then expose the pre-processed structured text as an API or MCP tool that other AI agents and workflows consume. They pay a fraction of what it would cost to process videos themselves.
Dog-food signal: genuine.
- PCRM research skills cover Twitter, Reddit, Threads, LinkedIn, Product Hunt, GitHub, App Store, Taobao, Vietnamese forums — but NOT TikTok or IG Reels
- Adding video platform research requires frame-by-frame vision processing
- At ~43 seconds average per TikTok video2 and 258 tokens per frame3, a single video costs ~11,000 tokens on Gemini Flash
- Processing 1,000 videos for one research query = ~11M tokens = US$1.67 input alone (Gemini 2.5 Flash batch) — this adds up fast across multiple daily research runs
II. Market Sizing
AI Video Analytics
US$6.2B
Social Listening
US$10.5B
Video Understanding API
US$50–200M
pre-category, fragmented
6
Your Addressable Slice
US$2–8M
AI agent workflows, dev tools
The global AI video analytics market is real (US$6.2B in 2026, growing to US$17B by 20314). But this includes surveillance, CCTV, manufacturing QA — most of which is NOT your market.
The actual addressable category is “video understanding APIs for developers and AI agents” — a pre-category market dominated by Twelve Labs (US$107M raised6) and a constellation of smaller players. Social media listening (US$10.5B5) is the downstream buyer, but they build internally or buy from Brandwatch/Sprinklr, not from an indie API.
Honest sizing: your realistic addressable market is US$2–8M.
That’s AI agent builders + indie devs + small social listening tools who need pre-processed video content and don’t want to manage their own vision pipeline. It’s a real niche. It’s not a venture-scale TAM.
III. The Math That Kills It (Unit Economics)
This is the section that matters most. The entire thesis rests on economies of scale — can you process video cheaply enough in bulk to resell at a margin? Let’s find out.
Cost Per Video: Raw Input
Average TikTok video: 43 seconds.2 Gemini extracts at 1 frame/second = 43 frames × 258 tokens/frame = 11,094 tokens video input, plus ~500 tokens audio = ~11,600 tokens total input.3
| Provider |
Model |
Cost / Video (43s) |
Cost / 1K Videos |
Notes |
| Google |
Gemini 2.5 Flash (batch, 50% off)7 |
US$0.0017 |
US$1.67 |
Current workhorse. Best price/quality. |
| Google |
Gemini 2.5 Flash (real-time) |
US$0.0033 |
US$3.33 |
Standard pricing, no batch discount. |
| Google |
Gemini 3 Flash (batch) |
US$0.0028 |
US$2.78 |
Newest model. $0.50/M input, 50% batch. |
| Google |
Gemini 2.5 Pro (batch)25 |
US$0.0069 |
US$6.94 |
Highest quality. $1.25/M, 50% batch. |
| OpenAI |
GPT-4.1 (frame extraction)26 |
US$0.017 |
US$17.15 |
~772 tokens/frame. $2.00/M input. |
| OpenAI |
GPT-4.1 mini |
US$0.007 |
US$7.20 |
~1,024 tokens/frame. $0.40/M input. |
| Anthropic |
Claude Sonnet 427 |
US$0.033 |
US$33.33 |
No native video — frame extraction only. $3.00/M. |
| Twelve Labs |
Marengo (video understanding)9 |
US$0.030 |
US$30.10 |
$0.042/min indexing. Proprietary models. |
| Google Cloud |
Video Intelligence API10 |
US$0.072 |
US$71.67 |
$0.10/min label + text + speech. |
| Google |
Gemini 2.0 Flash (batch) — legacy |
US$0.00029 |
US$0.29 |
Old model, floor price. Not representative. |
The death number: US$1.67 per 1,000 videos on Gemini 2.5 Flash batch.
- Anyone with a Gemini API key can process 1,000 TikTok videos for $1.67 on the current best model (2.5 Flash batch). Today.
- If you want GPT-4.1 quality: $17/1K. Claude Sonnet 4: $33/1K. These are meaningfully expensive — but the floor (Gemini batch) is still accessible to anyone.
- Gemini batch gives everyone 50% off — you don’t have a volume advantage they can’t get themselves
- Context caching adds another 90% discount for repeated videos7
- Vision costs are falling ~10x/year — today’s $1.67 will be $0.17 in 12 months
The Full COGS Stack
Processing video isn’t just vision tokens. Here’s the full cost per video:
| Cost Component |
Per Video |
Per 1K Videos |
Assumption |
| Vision processing (Gemini 2.5 Flash batch) |
$0.0017 |
$1.67 |
11.6K input tokens @ $0.15/M (batch) |
| Output generation (structured text) |
$0.00038 |
$0.38 |
~300 output tokens @ $1.25/M (batch) |
| Video download + storage (temp) |
$0.00010 |
$0.10 |
S3 egress + 5MB avg video, ephemeral |
| TikTok/IG scraping (Apify or custom) |
$0.00100 |
$1.00 |
SocialKit $13/2K requests11 |
| Infrastructure (queue, API, DB) |
$0.00020 |
$0.20 |
Railway/Fly, amortized |
| Total COGS |
$0.00338 |
$3.35 |
|
The Margin Problem
To sell this at a margin, you need to charge more than $0.0034/video. But your competition is the customer doing it themselves:
| Scenario |
Your Price |
DIY Cost (Gemini 2.5 Flash batch) |
Your Margin |
Customer Saves? |
| Premium (convenience tax) |
$0.02/video |
$0.0021 |
83% |
No — 9.5x more expensive |
| Competitive |
$0.01/video |
$0.0021 |
66% |
No — 4.8x more expensive |
| Aggressive |
$0.005/video |
$0.0021 |
32% |
No — 2.4x more expensive |
The fundamental problem: Gemini batch already IS the economies of scale.
Google processes video at $0.15/M tokens (Gemini 2.5 Flash batch) because they run it on their own TPUs at near-zero marginal cost. You can’t out-scale Google. You can’t get cheaper than their bulk pricing. The “economies of scale” thesis assumes you can get a better price than your customers — but Google gives everyone the same 50% batch discount. And today’s $0.15/M will be $0.015/M in 18 months.
IV. Competitive Landscape
A. The Funded Players
| Company |
Raised |
Model |
What They Do |
Why It Matters |
| Twelve Labs |
US$107M6 |
API (video understanding) |
Proprietary models (Marengo, Pegasus). Search, summarize, embed video. MCP server for agents.12 |
Direct competitor with 100x your capital. Already launched MCP server for agent integration. |
| Plot |
US$4.1M13 |
SaaS (video social listening) |
AI video social listening for TikTok, Reels, Shorts. Backed by Alexis Ohanian. Warby Parker, Visa, Mastercard clients. |
Exactly your downstream market — but they sell insights, not raw API. |
| VideoDB |
Unfunded14 |
API + MCP |
Video infrastructure: indexing, search, embeddings. Open-source agent toolkit. MCP server. |
Unfunded but first-mover on MCP agent integration. Acquired Devzery (testing). |
| Videolyze |
Unknown |
SaaS ($79–499/mo)15 |
AI video analysis for TikTok/YouTube. Sentiment, mentions, audience insights. |
Low-end competitor. “Analysis-day” pricing model. |
B. The Scrape-and-Transcribe Layer
A cheaper layer already exists that does 80% of what the thesis proposes:
| Service |
Price |
Platforms |
What It Does |
| SocialKit11 |
$13/mo (2K requests) |
TikTok, YouTube |
Transcript extraction with timestamps. IG coming soon. |
| Apify (NextAPI)16 |
Pay-per-event |
1,000+ platforms |
Video-to-text with translation. MCP server available. |
| Apify (InVideoIQ)17 |
$35/1K results |
IG, FB, X, TikTok |
Speech-to-text (not just captions). 99+ languages. |
| SociaVault18 |
Pay-per-use |
Instagram |
Reels transcript + language detection + confidence scores. |
| Supadata19 |
$9/1K credits |
YouTube, TikTok |
Web scraping + transcript extraction. |
Critical distinction: transcription vs. understanding.
- Transcription (what SocialKit/Apify do): extract spoken words from audio. Cheap ($0.006/min via Whisper20). Doesn’t see what’s ON SCREEN.
- Understanding (what the thesis proposes): see visual content, read text overlays, detect products, describe scenes. Requires vision models. 10–100x more expensive.
- The gap is real — but the price premium to close it is collapsing as Gemini gets cheaper every quarter
C. The Big Tech Ceiling
Google, OpenAI, and Anthropic all offer multimodal APIs that process video or image frames. They have three structural advantages you cannot match:
- Zero marginal cost on inference — they own the GPUs and the models
- Native batch pricing — Gemini gives everyone 50% off batch, you can’t get a better deal
- Context caching — 90% discount on repeated content7, destroying the cache-and-resell thesis
V. Stress-Testing the “Economies of Scale” Thesis
The core argument: “If lots of people want to process videos, we bulk-process for them with economies of scale, then sell the pre-processed text.” Let’s test every assumption.
The Bull Case
- Real pain — video is the last unstructured data silo for AI agents
- Convenience — scraping + downloading + processing + structuring is genuinely annoying to build
- Pre-indexed library — if you’ve already processed 10M TikToks, a customer can query instantly vs. waiting 24hrs for batch
- MCP/agent native — agents need a tool they can call, not a pipeline they need to build
- Scraping moat — TikTok/IG scraping is fragile; maintaining scrapers has operational cost
The Bear Case
- Zero cost advantage — Gemini batch gives everyone 50% off; you can’t undercut Google
- Twelve Labs has $107M and already launched an MCP server for agent video understanding
- Apify already has Video-to-Text MCP16 on their marketplace
- Cache = stale — social media content is ephemeral; a 24hr-old TikTok analysis is already outdated for trend research
- Platform risk — TikTok scraping is legally grey (ToS violation)21; one API shutdown kills the product
- Vision costs are falling 10x/year — your margin gets compressed every quarter
- No network effects — a video processing API is a commodity; customer switches when cheaper option appears
The “Pre-Indexed Library” Counter-Argument
The strongest version of the bull case is: “We process ALL popular TikToks/Reels proactively, build a searchable library, and customers query it instantly instead of waiting for batch processing.”
This is actually a different business.
A pre-indexed library of video content = a social media intelligence platform, not a processing API. That’s what Plot ($4.1M raised
13), Brandwatch (US$450M+ revenue), and Sprinklr (NASDAQ: CXM, US$734M revenue) already do. You’d be competing with enterprise SaaS companies, not selling a developer API.
The Real Question: Why Not Just Use Gemini Directly?
This is the question every potential customer will ask. Your answer needs to be one of:
- “We’re cheaper” — You’re not. Gemini 2.5 Flash batch at $0.15/M tokens is accessible to anyone. FAILS
- “We’re faster” — Only if you’ve pre-processed the exact video they need. Otherwise you’re also calling Gemini. FRAGILE
- “We handle the scraping” — Real value. Scraping TikTok/IG is fragile and annoying. But Apify already does this as MCP. COMMODITIZED
- “We structure the output for agents” — This is the only defensible angle — but it’s a thin wrapper, not a moat. THIN
VI. Pattern Matching: Who Succeeded, Who Failed
Succeeded (But Different Business)
| Company |
Model |
Revenue/Scale |
Why It Worked |
Why It Doesn’t Apply |
| Twelve Labs |
Video understanding API |
$107M raised, 30K users6 |
Built proprietary models before Gemini existed. First-mover. |
Had $107M to build custom models. You’d be reselling Gemini. |
| Roboflow |
Computer vision platform |
$25M raised22 |
Batch video inference 100x cheaper than real-time. Developer tools. |
They own the inference infrastructure. You’d be a reseller. |
| Plot |
Video social listening SaaS |
$4.1M seed, Warby Parker/Visa clients13 |
Sells insights to brand teams, not raw API to developers. |
They sell a SaaS dashboard at $X00/mo, not per-video API. |
| Brandwatch |
Social listening enterprise |
US$450M+ revenue |
20+ year head start. Enterprise sales. Image/video monitoring is a feature, not the product. |
You can’t sell enterprise SaaS solo. |
The Playbook That Works: Vertical Intelligence, Not Horizontal API
Every success in this space sells insights to a specific buyer, not raw processing to developers:
- Plot sells to brand marketing teams (Warby Parker, Visa)
- TikAlyzer sells to individual creators (hook analysis, viral prediction)23
- HookScan sells to content creators (first 3-5 second optimization)24
- Videolyze sells to agencies (sentiment reports, client-ready analysis)15
Nobody succeeded selling raw video-to-text as a horizontal API. The ones that survived either built proprietary models (Twelve Labs) or sold vertical intelligence (Plot, Brandwatch).
VII. Founder-Contextualized GTM
Eric’s Assets
- Dog-food — PCRM genuinely needs TikTok/IG video research
- Existing research skills infrastructure — 22 skills already built (Twitter, Reddit, Threads, App Store, etc.)
- MCP/agent expertise — builds agent tools daily (Donna, Sourcy, claw.degree)
- Agentic backend thesis — knows the while-loop+tool_use pattern
Eric’s Constraints
- 3 active projects — Donna (#1), Wenhao (#2), Dunbar (#3). Sourcy at 10-20% cap. Adding another project contradicts the project discipline commitment.
- $1K/week token burn — unsustainable. Adding a business whose margin compresses with every Gemini price drop makes this worse.
- No go-to-market for developer API — Eric’s edge is network + relationships (Dunbar), not dev tool marketing.
What Eric should actually do: build the TikTok/IG skill for PCRM.
- The real need is a PCRM research skill, not an API business
- Build
tiktok-research/SKILL.md and ig-reels-research/SKILL.md
- Use Gemini 2.5 Flash batch directly ($1.67/1K videos — good enough quality, sustainable cost)
- Combine audio transcription (Whisper, $0.006/min) with visual scene description (Gemini)
- Total cost per research query (100 videos): ~$0.20. Sustainable at $100 HKD/run.
- This is a tool for Donna, not a product to sell
VIII. Red Team: Steel-Manning the Business Case
The strongest version of this business is NOT a generic video-to-text API. It’s one of these:
Shape A: “TikTok Research MCP” (Agent Tool)
An MCP server that agents can call to search TikTok/IG by topic and get structured insights. Not raw text — analyzed, trend-identified, sentiment-tagged content. Apify’s Video-to-Text MCP16 does the raw version; this does the intelligence version. Price: $0.05–0.10/query (bundle of 50–100 videos analyzed). Margin: 60–70% if you use Gemini batch + Whisper.
Problem: It’s an MCP tool, not a business. MCP tools are meant to be free/cheap. The Twelve Labs MCP server12 is already available. You’d be competing with free.
Shape B: “Pre-Indexed Social Video Library” (SaaS)
Continuously process the top 100K TikToks/Reels per day. Build a searchable database. Sell to brand teams at $199–499/mo (like Videolyze15). This is the Plot playbook at 1/100th the scale.
Problem: You’d need to process 100K videos/day = US$167/day COGS (Gemini 2.5 Flash batch) + scraping costs. ~US$5,200/mo infrastructure. Need 11–26 paying customers at $199–499/mo to break even. Harder than it looks — and it’s a SaaS sales grind, not a build-and-sell-to-agents play.
Shape C: “Donna’s Eyes” (Internal Feature)
Build TikTok/IG video understanding as a Donna feature. Use it for Eric’s own research runs, Vincent’s opportunity screens, client reports. The “product” is the research output (charged at $100 HKD/run), not the video processing layer.
This is the only version that makes sense for Eric today. It directly improves the #1 priority (Donna), doesn’t require marketing a developer API, and the “customer” already exists (Eric himself + Vincent + future research clients).
Verdict
Don’t build the API business. Build the skill.
The “economies of scale” thesis is broken by a structural fact: Google gives everyone the same 50% batch discount. Gemini 2.5 Flash batch processes 1,000 videos for US$1.67. GPT-4.1 costs $17/1K, Claude Sonnet 4 costs $33/1K — the premium models are meaningfully expensive, but the floor is accessible to anyone with an API key. And that floor drops ~10x per year.
Every successful company in video intelligence either (a) built proprietary models before Gemini existed (Twelve Labs, $107M), or (b) sells vertical insights to specific buyers (Plot, Brandwatch), not horizontal processing APIs to developers.
The real opportunity is obvious and already in front of you: build TikTok and IG Reels as PCRM research skills. Use Gemini 2.5 Flash batch for vision + Whisper for transcription. Cost: ~$0.20 per 100-video research query. Easily sustainable at $100 HKD/run. This directly upgrades Donna (#1 priority), makes every future client research report richer, and doesn’t require marketing a new product.
The value isn’t in the processing. It’s in the judgment — knowing which 100 TikToks to analyze, what questions to ask of them, and how to synthesize the signal into a decision. That’s what clients pay $100 HKD/run for. That’s Donna’s moat. Not a commodity API.
References
[1]
OpenAI Developer Forum —
Discussion on vision API costs for video processing, 2025
[2]
TTS Vibes —
Average TikTok video length: 42.7 seconds in 2024, up 9.5% from 39s in 2023
[3]
S. Anand —
Gemini processes video at 1 FPS, 258 tokens per frame at low resolution
[4]
Research and Markets —
AI Video Analytics Market: US$6.19B (2026), US$17.24B (2031), 22.74% CAGR
[5]
Industry Today —
Social Media Listening Market: US$10.46B (2025), US$25.69B (2032), 13.7% CAGR
[6]
Parsers.vc —
Twelve Labs: $107M total raised including $30M Dec 2024 from Databricks, Snowflake, In-Q-Tel
[7]
Google AI —
Gemini API pricing: batch 50% off, context caching 90% off. Gemini 2.5 Flash: $0.30/M input, $2.50/M output. Gemini 3 Flash: $0.50/M input, $3.00/M output.
[8]
StrathWeb —
GPT-4o-mini vision: 2,833 tokens/frame (fixed), making it 2x more expensive than GPT-4o for bulk image processing
[9]
Twelve Labs Pricing —
Marengo video indexing: $0.042/min. Analyze/Summarize: $0.021/min.
[10]
Google Cloud Video Intelligence —
Label detection $0.10/min, text detection $0.15/min, speech $0.048/min. First 1K min/mo free.
[11]
SocialKit —
TikTok Transcript API: $13/mo for 2,000 requests. Multi-platform, timestamped segments.
[12]
Twelve Labs Blog —
Twelve Labs MCP Server: enables LLM agents to index, search, summarize video via standardized interface
[13]
Adweek —
Plot raises $4.1M seed from Alexis Ohanian. AI video social listening for TikTok/Reels/Shorts.
[14]
Tracxn — VideoDB —
VideoDB: unfunded, founded 2023 SF. Acquired Devzery. Video infrastructure + MCP agent toolkit.
[15]
Videolyze —
AI video analysis for TikTok/YouTube. Free–$499/mo. Analysis-day pricing model.
[16]
Apify — NextAPI Video-to-Text MCP —
MCP server: converts videos from 1000+ platforms to text. 12+ languages. Already agent-native.
[17]
Apify — InVideoIQ —
Video Transcriber: $35/1K results. Speech-to-text across IG, FB, X, TikTok. 99+ languages.
[18]
SociaVault —
Instagram Transcript API: speech-to-text for Reels, segmented timestamps, language detection.
[19]
Supadata —
Video Transcript API: $9/1K credits. YouTube + TikTok. Free 100 credits/mo.
[20]
Carl Pearson —
OpenAI Whisper: $0.006/min (~$0.36/hr) for speech transcription.
[21]
SociaVault —
TikTok scraping: legally permissible (hiQ v LinkedIn 2022) but ToS prohibit automated collection directly.
[22]
Roboflow —
Batch video inference up to 100x cheaper than real-time via GPU batching optimizations.
[23]
TikAlyzer —
AI TikTok analysis: frame-by-frame hook analysis, retention prediction, viral element detection.
[24]
HookScan —
AI-powered video hook analysis: first 3-5 seconds optimization for TikTok/Shorts/Reels.
[25]
PricePerToken —
Gemini 2.5 Pro: $1.25/M input, $10.00/M output. 50% batch discount. 1M context window.
[26]
PricePerToken —
GPT-4.1: $2.00/M input, $8.00/M output. ~772 tokens per image frame. 1M context window.
[27]
PricePerToken —
Claude Sonnet 4: $3.00/M input, $15.00/M output. No native video — frame extraction only. 50% batch discount.