Video-to-Text as a Service

Can bulk video processing become an API business? Or is Gemini already cheaper than you?
17 February 2026

I. The Opportunity

The thesis: short-form video (TikTok, IG Reels, YouTube Shorts) is the dominant content format, but AI agents and workflows can’t natively process video. They need text. Converting video→text requires extracting frames, running vision models, transcribing audio — and at scale, this is brutally expensive.1

The proposed business: batch-process millions of videos using economies of scale (bulk API pricing, GPU batching, cached embeddings), then expose the pre-processed structured text as an API or MCP tool that other AI agents and workflows consume. They pay a fraction of what it would cost to process videos themselves.

Dog-food signal: genuine.
  • PCRM research skills cover Twitter, Reddit, Threads, LinkedIn, Product Hunt, GitHub, App Store, Taobao, Vietnamese forums — but NOT TikTok or IG Reels
  • Adding video platform research requires frame-by-frame vision processing
  • At ~43 seconds average per TikTok video2 and 258 tokens per frame3, a single video costs ~11,000 tokens on Gemini Flash
  • Processing 1,000 videos for one research query = ~11M tokens = US$1.67 input alone (Gemini 2.5 Flash batch) — this adds up fast across multiple daily research runs

II. Market Sizing

AI Video Analytics
US$6.2B
2026, 22.7% CAGR4
Social Listening
US$10.5B
2025, 13.7% CAGR5
Video Understanding API
US$50–200M
pre-category, fragmented6
Your Addressable Slice
US$2–8M
AI agent workflows, dev tools

The global AI video analytics market is real (US$6.2B in 2026, growing to US$17B by 20314). But this includes surveillance, CCTV, manufacturing QA — most of which is NOT your market.

The actual addressable category is “video understanding APIs for developers and AI agents” — a pre-category market dominated by Twelve Labs (US$107M raised6) and a constellation of smaller players. Social media listening (US$10.5B5) is the downstream buyer, but they build internally or buy from Brandwatch/Sprinklr, not from an indie API.

Honest sizing: your realistic addressable market is US$2–8M. That’s AI agent builders + indie devs + small social listening tools who need pre-processed video content and don’t want to manage their own vision pipeline. It’s a real niche. It’s not a venture-scale TAM.

III. The Math That Kills It (Unit Economics)

This is the section that matters most. The entire thesis rests on economies of scale — can you process video cheaply enough in bulk to resell at a margin? Let’s find out.

Cost Per Video: Raw Input

Average TikTok video: 43 seconds.2 Gemini extracts at 1 frame/second = 43 frames × 258 tokens/frame = 11,094 tokens video input, plus ~500 tokens audio = ~11,600 tokens total input.3

Provider Model Cost / Video (43s) Cost / 1K Videos Notes
Google Gemini 2.5 Flash (batch, 50% off)7 US$0.0017 US$1.67 Current workhorse. Best price/quality.
Google Gemini 2.5 Flash (real-time) US$0.0033 US$3.33 Standard pricing, no batch discount.
Google Gemini 3 Flash (batch) US$0.0028 US$2.78 Newest model. $0.50/M input, 50% batch.
Google Gemini 2.5 Pro (batch)25 US$0.0069 US$6.94 Highest quality. $1.25/M, 50% batch.
OpenAI GPT-4.1 (frame extraction)26 US$0.017 US$17.15 ~772 tokens/frame. $2.00/M input.
OpenAI GPT-4.1 mini US$0.007 US$7.20 ~1,024 tokens/frame. $0.40/M input.
Anthropic Claude Sonnet 427 US$0.033 US$33.33 No native video — frame extraction only. $3.00/M.
Twelve Labs Marengo (video understanding)9 US$0.030 US$30.10 $0.042/min indexing. Proprietary models.
Google Cloud Video Intelligence API10 US$0.072 US$71.67 $0.10/min label + text + speech.
Google Gemini 2.0 Flash (batch) — legacy US$0.00029 US$0.29 Old model, floor price. Not representative.
The death number: US$1.67 per 1,000 videos on Gemini 2.5 Flash batch.
  • Anyone with a Gemini API key can process 1,000 TikTok videos for $1.67 on the current best model (2.5 Flash batch). Today.
  • If you want GPT-4.1 quality: $17/1K. Claude Sonnet 4: $33/1K. These are meaningfully expensive — but the floor (Gemini batch) is still accessible to anyone.
  • Gemini batch gives everyone 50% off — you don’t have a volume advantage they can’t get themselves
  • Context caching adds another 90% discount for repeated videos7
  • Vision costs are falling ~10x/year — today’s $1.67 will be $0.17 in 12 months

The Full COGS Stack

Processing video isn’t just vision tokens. Here’s the full cost per video:

Cost Component Per Video Per 1K Videos Assumption
Vision processing (Gemini 2.5 Flash batch) $0.0017 $1.67 11.6K input tokens @ $0.15/M (batch)
Output generation (structured text) $0.00038 $0.38 ~300 output tokens @ $1.25/M (batch)
Video download + storage (temp) $0.00010 $0.10 S3 egress + 5MB avg video, ephemeral
TikTok/IG scraping (Apify or custom) $0.00100 $1.00 SocialKit $13/2K requests11
Infrastructure (queue, API, DB) $0.00020 $0.20 Railway/Fly, amortized
Total COGS $0.00338 $3.35

The Margin Problem

To sell this at a margin, you need to charge more than $0.0034/video. But your competition is the customer doing it themselves:

Scenario Your Price DIY Cost (Gemini 2.5 Flash batch) Your Margin Customer Saves?
Premium (convenience tax) $0.02/video $0.0021 83% No — 9.5x more expensive
Competitive $0.01/video $0.0021 66% No — 4.8x more expensive
Aggressive $0.005/video $0.0021 32% No — 2.4x more expensive
The fundamental problem: Gemini batch already IS the economies of scale. Google processes video at $0.15/M tokens (Gemini 2.5 Flash batch) because they run it on their own TPUs at near-zero marginal cost. You can’t out-scale Google. You can’t get cheaper than their bulk pricing. The “economies of scale” thesis assumes you can get a better price than your customers — but Google gives everyone the same 50% batch discount. And today’s $0.15/M will be $0.015/M in 18 months.

IV. Competitive Landscape

A. The Funded Players

Company Raised Model What They Do Why It Matters
Twelve Labs US$107M6 API (video understanding) Proprietary models (Marengo, Pegasus). Search, summarize, embed video. MCP server for agents.12 Direct competitor with 100x your capital. Already launched MCP server for agent integration.
Plot US$4.1M13 SaaS (video social listening) AI video social listening for TikTok, Reels, Shorts. Backed by Alexis Ohanian. Warby Parker, Visa, Mastercard clients. Exactly your downstream market — but they sell insights, not raw API.
VideoDB Unfunded14 API + MCP Video infrastructure: indexing, search, embeddings. Open-source agent toolkit. MCP server. Unfunded but first-mover on MCP agent integration. Acquired Devzery (testing).
Videolyze Unknown SaaS ($79–499/mo)15 AI video analysis for TikTok/YouTube. Sentiment, mentions, audience insights. Low-end competitor. “Analysis-day” pricing model.

B. The Scrape-and-Transcribe Layer

A cheaper layer already exists that does 80% of what the thesis proposes:

Service Price Platforms What It Does
SocialKit11 $13/mo (2K requests) TikTok, YouTube Transcript extraction with timestamps. IG coming soon.
Apify (NextAPI)16 Pay-per-event 1,000+ platforms Video-to-text with translation. MCP server available.
Apify (InVideoIQ)17 $35/1K results IG, FB, X, TikTok Speech-to-text (not just captions). 99+ languages.
SociaVault18 Pay-per-use Instagram Reels transcript + language detection + confidence scores.
Supadata19 $9/1K credits YouTube, TikTok Web scraping + transcript extraction.
Critical distinction: transcription vs. understanding.
  • Transcription (what SocialKit/Apify do): extract spoken words from audio. Cheap ($0.006/min via Whisper20). Doesn’t see what’s ON SCREEN.
  • Understanding (what the thesis proposes): see visual content, read text overlays, detect products, describe scenes. Requires vision models. 10–100x more expensive.
  • The gap is real — but the price premium to close it is collapsing as Gemini gets cheaper every quarter

C. The Big Tech Ceiling

Google, OpenAI, and Anthropic all offer multimodal APIs that process video or image frames. They have three structural advantages you cannot match:


V. Stress-Testing the “Economies of Scale” Thesis

The core argument: “If lots of people want to process videos, we bulk-process for them with economies of scale, then sell the pre-processed text.” Let’s test every assumption.

The Bull Case

  • Real pain — video is the last unstructured data silo for AI agents
  • Convenience — scraping + downloading + processing + structuring is genuinely annoying to build
  • Pre-indexed library — if you’ve already processed 10M TikToks, a customer can query instantly vs. waiting 24hrs for batch
  • MCP/agent native — agents need a tool they can call, not a pipeline they need to build
  • Scraping moat — TikTok/IG scraping is fragile; maintaining scrapers has operational cost

The Bear Case

  • Zero cost advantage — Gemini batch gives everyone 50% off; you can’t undercut Google
  • Twelve Labs has $107M and already launched an MCP server for agent video understanding
  • Apify already has Video-to-Text MCP16 on their marketplace
  • Cache = stale — social media content is ephemeral; a 24hr-old TikTok analysis is already outdated for trend research
  • Platform risk — TikTok scraping is legally grey (ToS violation)21; one API shutdown kills the product
  • Vision costs are falling 10x/year — your margin gets compressed every quarter
  • No network effects — a video processing API is a commodity; customer switches when cheaper option appears

The “Pre-Indexed Library” Counter-Argument

The strongest version of the bull case is: “We process ALL popular TikToks/Reels proactively, build a searchable library, and customers query it instantly instead of waiting for batch processing.”

This is actually a different business. A pre-indexed library of video content = a social media intelligence platform, not a processing API. That’s what Plot ($4.1M raised13), Brandwatch (US$450M+ revenue), and Sprinklr (NASDAQ: CXM, US$734M revenue) already do. You’d be competing with enterprise SaaS companies, not selling a developer API.

The Real Question: Why Not Just Use Gemini Directly?

This is the question every potential customer will ask. Your answer needs to be one of:


VI. Pattern Matching: Who Succeeded, Who Failed

Succeeded (But Different Business)

Company Model Revenue/Scale Why It Worked Why It Doesn’t Apply
Twelve Labs Video understanding API $107M raised, 30K users6 Built proprietary models before Gemini existed. First-mover. Had $107M to build custom models. You’d be reselling Gemini.
Roboflow Computer vision platform $25M raised22 Batch video inference 100x cheaper than real-time. Developer tools. They own the inference infrastructure. You’d be a reseller.
Plot Video social listening SaaS $4.1M seed, Warby Parker/Visa clients13 Sells insights to brand teams, not raw API to developers. They sell a SaaS dashboard at $X00/mo, not per-video API.
Brandwatch Social listening enterprise US$450M+ revenue 20+ year head start. Enterprise sales. Image/video monitoring is a feature, not the product. You can’t sell enterprise SaaS solo.

The Playbook That Works: Vertical Intelligence, Not Horizontal API

Every success in this space sells insights to a specific buyer, not raw processing to developers:

Nobody succeeded selling raw video-to-text as a horizontal API. The ones that survived either built proprietary models (Twelve Labs) or sold vertical intelligence (Plot, Brandwatch).


VII. Founder-Contextualized GTM

Eric’s Assets

Eric’s Constraints

What Eric should actually do: build the TikTok/IG skill for PCRM.
  • The real need is a PCRM research skill, not an API business
  • Build tiktok-research/SKILL.md and ig-reels-research/SKILL.md
  • Use Gemini 2.5 Flash batch directly ($1.67/1K videos — good enough quality, sustainable cost)
  • Combine audio transcription (Whisper, $0.006/min) with visual scene description (Gemini)
  • Total cost per research query (100 videos): ~$0.20. Sustainable at $100 HKD/run.
  • This is a tool for Donna, not a product to sell

VIII. Red Team: Steel-Manning the Business Case

The strongest version of this business is NOT a generic video-to-text API. It’s one of these:

Shape A: “TikTok Research MCP” (Agent Tool)

An MCP server that agents can call to search TikTok/IG by topic and get structured insights. Not raw text — analyzed, trend-identified, sentiment-tagged content. Apify’s Video-to-Text MCP16 does the raw version; this does the intelligence version. Price: $0.05–0.10/query (bundle of 50–100 videos analyzed). Margin: 60–70% if you use Gemini batch + Whisper.

Problem: It’s an MCP tool, not a business. MCP tools are meant to be free/cheap. The Twelve Labs MCP server12 is already available. You’d be competing with free.

Shape B: “Pre-Indexed Social Video Library” (SaaS)

Continuously process the top 100K TikToks/Reels per day. Build a searchable database. Sell to brand teams at $199–499/mo (like Videolyze15). This is the Plot playbook at 1/100th the scale.

Problem: You’d need to process 100K videos/day = US$167/day COGS (Gemini 2.5 Flash batch) + scraping costs. ~US$5,200/mo infrastructure. Need 11–26 paying customers at $199–499/mo to break even. Harder than it looks — and it’s a SaaS sales grind, not a build-and-sell-to-agents play.

Shape C: “Donna’s Eyes” (Internal Feature)

Build TikTok/IG video understanding as a Donna feature. Use it for Eric’s own research runs, Vincent’s opportunity screens, client reports. The “product” is the research output (charged at $100 HKD/run), not the video processing layer.

This is the only version that makes sense for Eric today. It directly improves the #1 priority (Donna), doesn’t require marketing a developer API, and the “customer” already exists (Eric himself + Vincent + future research clients).


Verdict

Don’t build the API business. Build the skill.

The “economies of scale” thesis is broken by a structural fact: Google gives everyone the same 50% batch discount. Gemini 2.5 Flash batch processes 1,000 videos for US$1.67. GPT-4.1 costs $17/1K, Claude Sonnet 4 costs $33/1K — the premium models are meaningfully expensive, but the floor is accessible to anyone with an API key. And that floor drops ~10x per year.

Every successful company in video intelligence either (a) built proprietary models before Gemini existed (Twelve Labs, $107M), or (b) sells vertical insights to specific buyers (Plot, Brandwatch), not horizontal processing APIs to developers.

The real opportunity is obvious and already in front of you: build TikTok and IG Reels as PCRM research skills. Use Gemini 2.5 Flash batch for vision + Whisper for transcription. Cost: ~$0.20 per 100-video research query. Easily sustainable at $100 HKD/run. This directly upgrades Donna (#1 priority), makes every future client research report richer, and doesn’t require marketing a new product.

The value isn’t in the processing. It’s in the judgment — knowing which 100 TikToks to analyze, what questions to ask of them, and how to synthesize the signal into a decision. That’s what clients pay $100 HKD/run for. That’s Donna’s moat. Not a commodity API.


References

[1] OpenAI Developer ForumDiscussion on vision API costs for video processing, 2025
[2] TTS VibesAverage TikTok video length: 42.7 seconds in 2024, up 9.5% from 39s in 2023
[3] S. AnandGemini processes video at 1 FPS, 258 tokens per frame at low resolution
[4] Research and MarketsAI Video Analytics Market: US$6.19B (2026), US$17.24B (2031), 22.74% CAGR
[5] Industry TodaySocial Media Listening Market: US$10.46B (2025), US$25.69B (2032), 13.7% CAGR
[6] Parsers.vcTwelve Labs: $107M total raised including $30M Dec 2024 from Databricks, Snowflake, In-Q-Tel
[7] Google AIGemini API pricing: batch 50% off, context caching 90% off. Gemini 2.5 Flash: $0.30/M input, $2.50/M output. Gemini 3 Flash: $0.50/M input, $3.00/M output.
[8] StrathWebGPT-4o-mini vision: 2,833 tokens/frame (fixed), making it 2x more expensive than GPT-4o for bulk image processing
[9] Twelve Labs PricingMarengo video indexing: $0.042/min. Analyze/Summarize: $0.021/min.
[10] Google Cloud Video IntelligenceLabel detection $0.10/min, text detection $0.15/min, speech $0.048/min. First 1K min/mo free.
[11] SocialKitTikTok Transcript API: $13/mo for 2,000 requests. Multi-platform, timestamped segments.
[12] Twelve Labs BlogTwelve Labs MCP Server: enables LLM agents to index, search, summarize video via standardized interface
[13] AdweekPlot raises $4.1M seed from Alexis Ohanian. AI video social listening for TikTok/Reels/Shorts.
[14] Tracxn — VideoDBVideoDB: unfunded, founded 2023 SF. Acquired Devzery. Video infrastructure + MCP agent toolkit.
[15] VideolyzeAI video analysis for TikTok/YouTube. Free–$499/mo. Analysis-day pricing model.
[16] Apify — NextAPI Video-to-Text MCPMCP server: converts videos from 1000+ platforms to text. 12+ languages. Already agent-native.
[17] Apify — InVideoIQVideo Transcriber: $35/1K results. Speech-to-text across IG, FB, X, TikTok. 99+ languages.
[18] SociaVaultInstagram Transcript API: speech-to-text for Reels, segmented timestamps, language detection.
[19] SupadataVideo Transcript API: $9/1K credits. YouTube + TikTok. Free 100 credits/mo.
[20] Carl PearsonOpenAI Whisper: $0.006/min (~$0.36/hr) for speech transcription.
[21] SociaVaultTikTok scraping: legally permissible (hiQ v LinkedIn 2022) but ToS prohibit automated collection directly.
[22] RoboflowBatch video inference up to 100x cheaper than real-time via GPU batching optimizations.
[23] TikAlyzerAI TikTok analysis: frame-by-frame hook analysis, retention prediction, viral element detection.
[24] HookScanAI-powered video hook analysis: first 3-5 seconds optimization for TikTok/Shorts/Reels.
[25] PricePerTokenGemini 2.5 Pro: $1.25/M input, $10.00/M output. 50% batch discount. 1M context window.
[26] PricePerTokenGPT-4.1: $2.00/M input, $8.00/M output. ~772 tokens per image frame. 1M context window.
[27] PricePerTokenClaude Sonnet 4: $3.00/M input, $15.00/M output. No native video — frame extraction only. 50% batch discount.