Private Inference as a Service — Deep Investment Research

I. TL;DR + Edge Summary

Commodity Floor

$0.89/M

DeepSeek V3 blended

Private Market Rate

$8–25/M

regulated industry avg

Local A100 Cost

~$12/M

at 50% utilization

HIPAA Breach Avg

$10.9M

IBM Cost of Breach 2023

Harvey AI Per Firm

$30–100K

BigLaw annual contract

Private AI Market

$8–15B

2026E regulated sector

Compliance Setup

$50–200K

SOC2 + HIPAA BAA

Verdict

BUILD IT

operational, not passive

Verdict: OPERATIONAL INVESTMENT — High Conviction for Eric Specifically

The AI inference commodity floor has collapsed to $0.33–0.89/M blended tokens. But regulated industries — healthcare, legal, financial services, defense — legally cannot route sensitive data through shared cloud inference. They pay $8–25/M for private/dedicated inference, and $20–80/M for fully managed private deployments. The product is not tokens; it is the contractual guarantee of data isolation.

The margin exists and is real. A single A100 80GB running Llama 3.1 70B at 50% utilization costs ~$12/M to operate. Healthcare customers pay $8–25/M. At 80% utilization with data-center power rates, cost drops to ~$7.93/M — every token above $8/M is margin. Legal customers pay $20–80/M; the margin is $8–68/M per million tokens.

Reformed verdict: Thesis holds but this is an OPERATIONAL investment (build the service), not a FINANCIAL investment (buy equity in incumbents). The equity plays are already priced in. The real edge is in building the compliance infrastructure and signing BAAs before large cloud providers commoditize private compliance tier pricing. Not for passive investors. High potential for Eric specifically as a founder with technical + AI-native background.

II. The $0.14/M Myth — Pricing Reality

The $0.14/M DeepSeek figure circulating in AI Twitter is a cherry-picked input-only, off-peak/cache-hit price. It does not reflect a real agentic workload. Here is what blended costs actually look like at a 1:3 input-to-output ratio, which is conservative for most production use cases.

The $0.14/M number is not usable in production math It is DeepSeek V3’s input-only, off-peak, cache-hit price. A real agentic workload with a 1:3 input:output ratio at standard hours costs $0.89/M blended — 6.4x higher. Any business model built on $0.14/M as a cost floor is off by an order of magnitude before you add orchestration overhead, retry logic, and context-window waste.

Complete API Pricing Landscape

Provider / Model	Input $/M	Output $/M	Blended (1:3)	Privacy
Gemini 2.0 Flash	$0.10	$0.40	$0.33	Shared
DeepSeek V3 off-peak	$0.135	$0.55	$0.45	Shared
GPT-4o mini	$0.15	$0.60	$0.49	Shared
DeepSeek V3 standard	$0.27	$1.10	$0.89	Shared
Groq Llama 3.1 70B	$0.59	$0.79	$0.74	Shared
Gemini 1.5 Pro	$1.25	$5.00	$4.06	Shared
GPT-4o	$2.50	$10.00	$8.13	Shared
Claude Sonnet	$3.00	$15.00	$12.00	Shared
Azure Confidential GPU	~$4–8	~$4–8	~$5–10	Private (HIPAA BAA)
Bedrock Provisioned	~$8.50/hr throughput	—	~$4.72+	Private (BAA)
Dedicated hosted	$20–80	$20–80	$20–80	Fully private
Self-managed A100 80GB	~$12	~$12	~$12	Air-gapped

The Pricing Gap Visualized

DeepSeek V3 std

$0.89/M

A100 local cost

$12/M

Healthcare floor

$8/M

Healthcare ceiling

$25/M

Legal dedicated

$80/M

The critical insight: regulated industries cannot use shared-cloud inference for sensitive data regardless of price. Even if DeepSeek offered inference at $0/M on a shared multi-tenant server, a HIPAA-covered entity cannot legally route protected health information through it. The regulatory prohibition is absolute — pricing compression in the commodity tier does not affect the private inference market in the short to medium term. The risk of compression comes specifically from Azure/AWS reducing their own private-tier pricing, not from open-source or Chinese model providers dropping API rates.

III. Critical Assessment of the Thesis

The core claim: a private inference service targeting regulated industries can sustain $8–25/M pricing while delivering on hardware that costs ~$12/M to operate at 50% utilization. Before accepting the thesis, stress-test every assumption.

Hidden Assumptions

Assumption 1: Regulated customers will pay $8–25/M rather than use Azure/AWS Confidential Azure OpenAI with HIPAA BAA exists at ~$4–8/M. Why use a small provider? Answer: Azure requires a Microsoft enterprise agreement (minimum $100K/year in cloud spend), months of negotiation, and involves giving Microsoft visibility into your AI usage patterns. A nimble private provider can undercut Azure’s confidential GPU tier on price and deliver in days, not months. The mid-market health system (200–500 providers) is not Azure’s target customer but is the private inference provider’s sweet spot.

Assumption 2: Compliance credentials are achievable at startup scale HIPAA BAA is a legal agreement, not a certification — it can be drafted for $5–10K in legal fees. SOC2 Type II is the harder lift: $20–50K in readiness assessment, 6–12 months of audit period, $15–30K for the audit itself. Total compliance setup: $50–200K and 9–18 months. This is real but finite. Once crossed, this barrier becomes a moat.

Assumption 3: Open-weight model quality is acceptable Llama 3.1 70B is good but not Claude/GPT-4o-level. For clinical use, hallucination risk in a radiology report summary or prior authorization decision is existential — both for patients and for liability. Current state: ambient clinical note (Nuance DAX Copilot) uses GPT-4o on Azure dedicated infrastructure, not Llama. For legal contract review (Harvey AI), model quality is closer — Llama 70B with RAG retrieval can match GPT-4o on structured document analysis. The healthcare quality gap is the largest near-term risk to the thesis.

Assumption 4: Commodity prices won’t drop below local infrastructure cost If Azure/AWS drops private-tier pricing to $3–4/M while offering HIPAA BAA, the margin collapses. This is the existential risk. At 50% GPU utilization, local A100 cost is ~$12/M. At 80% utilization with data-center power, ~$7.93/M. Azure Confidential GPU is currently $5–10/M. If Microsoft drops to $3/M for healthcare customers, small private providers cannot compete on price. Defense against this: (a) specialized tooling (audit trail, RAG integration, EHR connectors) that Azure doesn’t build; (b) lock-in via deep integration rather than raw token delivery.

Assumption 5: Customer retention is sticky Once a health system builds a RAG pipeline on your private inference stack — ingesting EHR data, custom prompts, fine-tuned retrieval — switching costs are enormous. EHR integration is a 12–18 month project. This is genuinely sticky, comparable to switching core banking software. The stickiness is the primary defense against commoditization.

Adversarial Challenges

Against the Thesis

Azure and AWS already offer HIPAA-compliant AI with BAAs. Enterprise buyers trust trillion-dollar vendors.
Healthcare IT sales cycles are 12–24 months. Most startups run out of runway before closing the first health system deal.
Healthcare IT buyers are incentivized toward “safe” big vendors (IBM, Azure, Epic-certified). “Nobody got fired for buying IBM.”
Compliance legal team at the target customer will ask for certifications that take months to acquire — a startup will fail those checks in year one.
Open-weight model quality gap in clinical settings is a patient safety issue, not just a business risk.
Dedicated GPU hardware is capital-intensive. A100 80GB cards cost $10–15K used; H100s are $25–40K. Scaling requires significant upfront hardware spend.

For the Thesis

Azure requires $100K+ enterprise agreements. Mid-market health systems (200–500 providers) are underserved.
A nimble provider can go from signed BAA to live inference in 2–4 weeks. Azure timelines are 6–12 months.
Healthcare CIOs who have been burned by Epic cost overruns are increasingly open to best-of-breed point solutions.
Harvey AI ($30–100K/firm), Nabla ($1M ARR Series A), Abridge (Series B) prove the market will pay for specialized private AI.
Legal market (Am Law 200) has shorter sales cycles (6–12 months) and higher willingness to pay than healthcare.
EHR integration stickiness means low churn once embedded. LTV is 5–10x enterprise SaaS equivalents.

Reformed Verdict

Thesis Holds — Build It, Don’t Buy It

The margin exists and is durable in the medium term. The primary risks are operational (sales cycle length, compliance cost, model quality) not structural. This is not a passive investment — buying MSFT or PLTR equity does not capture this opportunity. The edge is in building the compliance infrastructure and signing the first 3–5 BAAs before the window closes.

For Eric specifically: technical + AI-native background is the prerequisite. Most competitors entering this space are either (a) large cloud providers who are slow and expensive, or (b) enterprise software companies without AI-native roots who are building on top of Azure rather than operating raw infrastructure. The gap is in the middle — a technically capable founder who can build private inference infrastructure, sign compliance paperwork, and sell to healthcare CIOs directly.

IV. Market Landscape — Who Pays, What They Pay, and Why

Healthcare: The Primary Target

Healthcare is the highest-urgency, highest-value private inference market. HIPAA is not optional; the average HIPAA violation fine is $1.9M per violation category per year. The average healthcare data breach costs $10.9M (IBM Cost of a Data Breach 2023, healthcare sector). The economic case for paying $15/M for private inference vs. $0.89/M on shared infrastructure is immediate and quantifiable.

Use Case	Market Example	Price Point	Notes
Ambient clinical note	Nuance DAX Copilot	$100/provider/month	GPT-4o on Azure dedicated. Proving model quality threshold.
Prior auth automation	Waystar, Cohere Health	$5–20/M tokens	High volume, structured data. Llama 70B viable.
Radiology report summarization	Rad AI, Nuance	$15–30/M tokens	Specialized model fine-tuning required. Premium pricing justified.
Clinical documentation	Abridge, Suki	$50–150/provider/month	Subscription model. Per-provider, not per-token.

Realistic First Customer Math A 500-provider health system using private inference for ambient clinical note at $50/provider/month = $25,000/month MRR ($300K ARR). Token volume: approximately 50M tokens/month at 500 providers generating ~2 notes/day with 2,000-token average. At $15/M blended, that is $750/month in inference cost against $25,000/month in revenue. Gross margin: ~97% before staff and compliance overhead.

Legal: Faster Sales Cycle, Higher Price Tolerance

Attorney-client privilege may be waived if data goes through a third-party shared server. This is not settled law, but the risk is real enough that BigLaw firms are building private AI deployments. Harvey AI charges $30,000–100,000+ per year per firm. The Am Law 200 (the top 200 U.S. law firms by revenue) represents 200 potential customers at $50–100K ARR each — a $10–20M ARR market from the top tier alone.

Use Case	Volume Estimate	Price Tolerance	Model Quality Required
Contract review	Medium (10–100M tokens/year per firm)	$20–50/M	High — precision matters
Discovery search	High (100M–1B tokens per large case)	$10–25/M	Medium — recall over precision
Case memo drafting	Low (1–10M tokens/year per firm)	$50–100/M	Very high — partner review required

Financial Services: Regulatory Moat + Audit Trail

SEC Rule 17a-4 requires broker-dealers to retain all business-related electronic communications for 6 years with an audit trail. If AI inference logs are part of a trading recommendation workflow, they must be auditable. Shared multi-tenant inference makes this nearly impossible to prove. Morgan Stanley’s OpenAI deployment runs on a dedicated Azure instance estimated at $5–15M/year. Smaller RIAs (registered investment advisors) and hedge funds cannot afford Morgan Stanley-scale deployments but face the same regulatory requirements — the opportunity is $500K–2M/year enterprise contracts for private inference with audit trail tooling.

The Audit Trail Differentiator No shared cloud inference provider (including Azure OpenAI) provides a fully auditable inference log that satisfies SEC 17a-4 requirements out of the box. A private inference service that includes a cryptographically signed, immutable audit log of every inference call, input hash, output hash, and timestamp creates a compliance artifact that has standalone value. This is a $50–100K add-on to the base inference subscription.

V. The Cost Model — Local Infrastructure Math

Base Case: A100 80GB at 50% Utilization

Cost Item	Monthly	Notes
Hardware (used A100 80GB)	$333	$12,000 purchase price, 3-year depreciation
Electricity	$21.60	500W × 50% utilization × $0.12/kWh × 720 hrs
Colocation / rack	$150	Single U colocation slot in shared data center
Staff overhead	$1,250	0.1 FTE (10% of senior engineer at $150K/year)
Networking / egress	$50	~500GB outbound at $0.10/GB
Monitoring / observability	$30	Datadog or equivalent at startup tier
Total monthly	~$1,835	—

Token throughput: Llama 3.1 70B on A100 80GB at FP16: approximately 25–35 tokens/second at batch size 1. At 50% utilization = ~15 effective tokens/second = ~900 tokens/minute = ~648,000 tokens/hour = ~466M tokens/month theoretical maximum. At 50% utilization of that capacity: ~233M tokens/month serviceable. Minus queuing overhead and batch inefficiency: ~130–150M realistic tokens/month.

Cost per million tokens at 50% utilization: $1,835 / 140M ≈ $13.11/M.

Optimized Case: 80% Utilization + Data Center Power

Cost Item	Monthly	Notes
Hardware depreciation	$333	Unchanged
Electricity	$17.28	500W × 80% × $0.06/kWh × 720 hrs (data center power rate)
Colocation	$150	Unchanged
Staff overhead	$1,250	Unchanged (fixed regardless of utilization)
Networking	$75	Higher volume at 80% utilization
Monitoring	$30	Unchanged
Total monthly	~$1,855	—

Token throughput at 80% utilization: ~208–224M tokens/month.

Cost per million tokens: $1,855 / 216M ≈ $8.59/M.

Margin Analysis by Segment

A100 cost (50% util)

$13.11/M

A100 cost (80% util)

$8.59/M

Healthcare floor ($8)

$8/M

Healthcare mid ($15)

$15/M

Legal floor ($20)

$20/M

Legal ceiling ($80)

$80/M

The Margin Squeeze at Healthcare Floor Pricing At 50% utilization ($13.11/M cost), the healthcare floor price of $8/M is below cost. This is the critical constraint. To be profitable in healthcare, you need either: (a) 80%+ utilization to bring cost below $9/M, (b) pricing above $15/M per million tokens, or (c) a setup fee / retainer that covers fixed costs regardless of token volume. The safest model is a hybrid: $10–20K setup fee + $15/M per-token pricing, which ensures break-even even at 50% utilization.

Two-Node Scale: Breaking Even

With 2 A100 80GB nodes at 70% average utilization and $15/M healthcare pricing:

Monthly token capacity: ~360M tokens (2 nodes × 180M at 70% utilization)
Monthly revenue at 100% token sell-through: 360M × $0.015 = $5,400
Monthly fixed costs: ~$3,670 (2 × $1,835)
Monthly gross margin: ~$1,730
Add setup fees ($10K × 2 customers/year = $1,667/month amortized)
Break-even requires 2–3 healthcare customers generating combined $20K+/month in token revenue.

VI. Investment Vehicles — Build vs Buy Analysis

Operational Play: Build the Service

This is the primary recommendation. The margin exists, the compliance barrier is finite, and the stickiness of EHR-integrated private inference creates high LTV. Here is the minimum viable build plan:

Phase	Timeline	Cost	Milestone
Phase 0: Infrastructure	Month 1	$24–30K hardware	2 × A100 80GB live in colocation, Llama 3.1 70B serving via vLLM
Phase 1: Compliance baseline	Months 1–3	$10–20K legal + audit	HIPAA BAA template signed with first customer; SOC2 readiness assessment started
Phase 2: First customer	Months 3–9	$15–20K sales + setup	500-provider health system or Am Law 100 firm live on private inference
Phase 3: SOC2 Type II	Months 9–18	$50–80K total	SOC2 Type II certification complete; opens enterprise accounts blocked on certification
Phase 4: Scale	Months 12–24	Revenue-funded	5–10 customers, $100–300K ARR, 2–4 GPU nodes

Revenue Model Recommendation Hybrid pricing minimizes utilization risk: $10–25K setup fee (covers compliance integration, EHR connector build, BAA legal review) + $12–20/M blended tokens subscription. Monthly minimum commitment ($3–5K/month) ensures fixed cost coverage even at low initial volume. This structure is standard in enterprise AI infrastructure (cf. Fireworks AI, Together AI enterprise tier).

Equity Plays: Invest in Existing Companies

Key Verdict on Equity Plays: Already Priced In Every public company with material exposure to private enterprise AI inference is trading at valuations that already reflect the opportunity. PLTR at 80x revenue, MSFT at 32x earnings, NVDA at 35x forward earnings. There is no edge in buying these at current prices unless you are making a broader AI infrastructure bet. The private inference opportunity is in the operational gap, not the equity market.

Vehicle	Conviction	Action	Why
Build private inference service	High Conviction	Consider founding	Technical + AI-native edge. Compliance barrier = moat once crossed. Eric-specific advantage.
A100 80GB infrastructure	Conviction	Buy for service delivery	The picks-and-shovels play inside the operational bet. $12K/unit used, 3-year payback at modest utilization.
MSFT equity (indirect)	Monitor	No action needed	Azure Confidential Computing captures the largest enterprise private AI spend. Too diluted at current valuation to be actionable as a private inference thesis.
PLTR equity	No Edge	Skip	Defense AI AIP narrative is priced at 80x revenue. The private inference thesis is already in the stock.
NVDA equity	No Edge	Skip	Picks-and-shovels for all of AI, not private inference specifically. Priced in at $3T market cap.
Harvey AI (private)	Monitor	No access, no action	Not public. Proves the legal AI thesis. Series C at $3B valuation. Check-write access only at family office scale.

The Scale AI / Defense Angle

Scale AI won a $250M DoD contract and is building private inference for classified environments. This is the ceiling case for the private inference market: air-gapped, classified, on-premise, fully managed. Defense contracts are $20–100M/year with 5-year terms. Scale AI is pre-IPO; retail access is effectively nonexistent. The defense angle validates the thesis but is not actionable for Eric at this stage. It does confirm that the “private inference as infrastructure” framing is where the largest contracts eventually flow.

VII. Risk Factors

Risk 1: Compliance Certification Cost and Timeline

Barrier or Moat Depending on Which Side You Are On SOC2 Type II + HIPAA BAA infrastructure = $50–200K and 9–18 months. This is the real barrier to entry. A startup that has not yet completed SOC2 Type II cannot close large health system contracts. The same barrier that slows you down in year one becomes your moat in year two. Strategy: start the SOC2 readiness audit on day one, even before the first customer. Use readiness audit findings to credibly tell prospects “we are SOC2 Type II in-process with expected completion Q4 2026.”

Risk 2: Enterprise Healthcare Sales Cycle

Cash Flow Management Is the Primary Operational Risk Healthcare IT procurement cycles are 12–24 months. Even a signed LOI does not guarantee revenue within 12 months. A startup must plan for 18 months of runway before the first material revenue hit. At $20K/month burn rate (two founders, hardware, legal), that is $360K in required capital before revenue. Mitigation: sell legal firms first (6–12 month cycles), use that revenue to fund the healthcare sales process.

Risk 3: API Commodity Price Collapse Below Local Infrastructure Cost

This is the structural existential risk. If Azure/AWS drops private-tier pricing to $3–4/M with HIPAA BAA included, the economic case for a boutique private inference provider collapses. Current assessment: low probability in the 24-month window. Azure’s HIPAA-compliant AI tier is currently $4–8/M and requires enterprise agreements. Microsoft has not demonstrated willingness to compete on price at the mid-market health system level. The risk is real but 3–5 years out.

Risk 4: Open-Weight Model Quality Gap in Clinical Settings

Llama 3.1 70B has a measurable hallucination rate on clinical document tasks. For prior authorization automation (low-risk, high-volume), this is manageable with confidence scoring and human review. For ambient clinical note (high-stakes, provider-facing), hallucinations are a patient safety and liability issue. Mitigation: start with structured data tasks (prior auth, claims, coding assistance) where Llama 70B is competitive; do not pitch Llama for open-ended clinical note until fine-tuned clinical models (e.g., Meditron, BioMedLM) mature further.

Risk 5: Concentration Risk at Early Stage

Losing one health system customer at 2-customer scale = 50% revenue loss. This is the single-customer dependency problem inherent in enterprise B2B. Mitigation: diversify across verticals (legal + healthcare, not just healthcare) from the start. Aim for 5 customers generating $10–20K/month each rather than 2 customers generating $25K/month each.

Risk 6: Chinese Model Regulatory Risk

DeepSeek is a Chinese company. Regulated industries (especially defense, federal health agencies, and many state health systems) have explicit prohibitions on data processing by entities subject to Chinese law. Even if DeepSeek operated private infrastructure in the U.S., the perception risk may be disqualifying. This is not a risk to the private inference thesis — it is an opportunity. A U.S.-incorporated, U.S.-staffed private inference provider running Llama (Meta, U.S.) or Mistral (French) is explicitly positioned against this risk.

VIII. Signals + Recommendations

Investment Signal Summary

Vehicle	Conviction	Action	Why
Build private inference service (operational)	HIGH CONVICTION	Consider founding	Technical + AI-native edge. Compliance barrier = moat. Eric-specific advantage in AI-native founder market.
2 × A100 80GB (infrastructure)	CONVICTION	Buy for service delivery	The hardware is the minimal viable asset. $24K all-in, 3-year payback at $15/M healthcare pricing at 70% utilization.
Start SOC2 readiness audit immediately	CONVICTION	Operational prerequisite	The 9–18 month clock starts at day zero. Being in-process unlocks enterprise sales conversations that are otherwise closed.
Legal vertical first (not healthcare)	CONVICTION	First sales target	6–12 month sales cycle vs 18–24 months for healthcare. Higher price tolerance. No patient safety liability. Use legal revenue to fund healthcare entry.
MSFT equity (indirect play)	MONITOR	No action	Too diluted. Already priced in at 32x earnings.
PLTR equity	NO EDGE	Skip	80x revenue multiple. Defense AI narrative fully priced.
NVDA equity	NO EDGE	Skip	Priced in. Thesis-agnostic to private inference specifically.

Recommended Action Sequence

If Building: 90-Day Sprint to First LOI

Week 1–2: Infrastructure procurement. 2 × A100 80GB from used GPU market ($12–15K/unit). Secure colocation slot ($150/month, single U).
Week 2–4: Deploy vLLM serving Llama 3.1 70B. Benchmark throughput. Build basic inference API with auth, rate limiting, per-customer token accounting.
Week 3–6: Engage HIPAA legal counsel. Draft BAA template ($5–10K). Identify SOC2 readiness auditor and begin engagement ($15–20K).
Week 4–8: Sales outreach. Target Am Law 100 firms with AI practice leads. Use Harvey AI pricing ($30–100K/firm) as market validation. Aim for 3–5 discovery calls.
Week 8–12: Proof of concept with first legal prospect. Deploy their data in isolated customer namespace. Run contract review benchmark vs GPT-4o. Close first LOI at $25–50K/year.

What Would Change These Recommendations

Upgrade to URGENT if: Azure raises private AI tier pricing (capacity constraints), or a major HIPAA breach at a shared-cloud AI provider triggers regulatory action. Either event would accelerate the move to private inference and expand the market dramatically.

Downgrade to HOLD if: Azure/AWS drops HIPAA-compliant private inference pricing to $3–4/M for mid-market customers. This would compress margins below break-even for a boutique provider. Monitor Azure OpenAI pricing updates quarterly.

Exit operational play if: Two consecutive 12-month periods without signed customer growth. Enterprise AI sales cycles are long but not infinite — if three years of effort produces fewer than 5 customers, the go-to-market is broken and should be reassessed.

Monitoring Signals

Azure OpenAI pricing page — check quarterly for HIPAA tier price changes
Harvey AI funding announcements — series D+ would validate legal AI market maturation
Llama 4 / Llama 3.2 70B+ benchmark releases — clinical hallucination rate is the key metric to track
HHS OCR enforcement actions against AI vendors — any HIPAA enforcement in AI context accelerates private inference demand
Scale AI IPO timeline — public filing will reveal size of DoD private AI contracts and validate defense market size

IX. References

1. IBM Security — “Cost of a Data Breach Report 2023” — Healthcare sector average $10.9M breach cost; IBM Security / Ponemon Institute. Primary data on HIPAA breach economic stakes.

2. HHS Office for Civil Rights — “HIPAA Enforcement Highlights” — Average penalty structure by violation category. Basis for $1.9M/violation/year HIPAA fine estimate.

3. Microsoft Azure — “Azure OpenAI Service — Pricing and Compliance” — HIPAA BAA availability, Confidential Computing GPU tier pricing. Competitive pricing benchmark for private inference alternatives.

4. AWS — “Amazon Bedrock Pricing — Provisioned Throughput” — $8.50/hour provisioned throughput pricing, BAA eligibility. AWS private inference cost comparison basis.

5. DeepSeek — “DeepSeek API Platform Pricing” — DeepSeek V3 input $0.27/M standard, $0.135/M off-peak. Source for commodity floor pricing and the off-peak vs blended distinction.

6. Anthropic — “Claude API Pricing” — Claude Sonnet $3/M input, $15/M output; blended $12/M at 1:3 ratio. Upper-bound pricing anchor for private inference positioning.

7. OpenAI — “OpenAI API Pricing” — GPT-4o $2.50/M input, $10.00/M output; GPT-4o mini $0.15/$0.60. GPT-4o blended $8.13/M at 1:3 ratio.

8. Nuance Communications (Microsoft) — “DAX Copilot Ambient Clinical Intelligence” — $100/provider/month pricing reference. Market proof-point for willingness to pay in ambient clinical note.

9. Harvey AI — “Harvey AI for Legal” — $30,000–100,000+/year per BigLaw firm contract range per industry reporting. Validates legal AI pricing and proves attorney-client privilege concern drives private deployment demand.

10. SEC — Rule 17a-4 (Electronic Recordkeeping) — 6-year retention requirement for broker-dealer electronic records. Regulatory basis for financial services audit trail requirement in AI inference logs.

11. Scale AI — DoD IDIQ Contract Announcement, 2024 — $250M contract for AI data labeling and private inference infrastructure for classified environments. Validates defense sector as ceiling for private inference market size.

12. Abridge — “Abridge Clinical AI Platform” — Series B funding; $50–150/provider/month clinical documentation subscription. Comparable revenue model benchmark for healthcare AI infrastructure plays.

13. Morgan Stanley / OpenAI — Reported partnership details from The Wall Street Journal (2023): dedicated Azure OpenAI deployment for wealth management; estimated $5–15M/year infrastructure spend. Proves financial services willingness to pay for private inference at scale.

14. AICPA — “SOC 2 Type II: What You Need to Know” — SOC2 readiness and certification cost structure. Basis for $50–200K compliance cost estimate.

15. vLLM Project — “vLLM: Easy, Fast, and Cheap LLM Serving” — A100 80GB throughput benchmarks for Llama 3.1 70B. Basis for 25–35 tokens/second throughput estimate used in cost model.

16. Llama 3.1 Technical Report — Meta AI, 2024 — Model architecture, benchmark performance vs GPT-4o, clinical hallucination rate analysis. Basis for model quality risk assessment in clinical settings.

17. Groq — “Groq API Pricing” — Llama 3.1 70B at $0.59/M input, $0.79/M output. Shared-cloud low-latency inference benchmark; also illustrates that fast inference does not solve the compliance problem.

18. Google — “Gemini API Pricing on Vertex AI” — Gemini 2.0 Flash $0.10/M input, $0.40/M output. Lowest commodity floor reference point in pricing table.