From Wow to ‘Meh’: How Near-Perfect AI Still Tanks Billion-Dollar Bets

The 7% gap that should inform your decision (and save you time, money, broken careers)

Jun 19, 2025

🔎 SIGNAL

I keep seeing the same chart in board decks: proof-of-concepts launched vs. projects in production. It looks like the Matterhorn—steep ascent, sheer drop. Forbes pegs the drop-off at 90% of generative-AI pilots never making it past-beta.

“Many companies pilot gen AI for use cases in which it is the wrong technology to address the business need. In some cases, the companies found better technologies, already developed or even already in place, that met their needs better than gen AI. In other cases, they found that gen AI is not yet mature enough to execute the tasks robustly in the production environment.” (Forbes)

🎥 STORY

Let me share a story. It's fictional, but I've lived through enough versions of it to know it's painfully honest.

January rolls around. A global insurer gets excited about building a "RAG-powered automatic underwriting concierge." The Underwriting team sees the demo and falls in love. ChatGPT answering every underwriting question in plain English? Sign us up. The engineers? They're seeing dollar signs at $0.002 per request.

March arrives with a reality check. The concierge starts hallucinating cash-flow models. Inventing policy clauses out of thin air. The final nail? It cites a bankrupt company as “A-rated.” All referencing documents are either outdated or incorrect. (If you’ve tried doing research with Perplexity, you’ve seen that too.)

Internal auditors kill it. Dead on arrival.

S&P Global's latest numbers tell the broader story: companies abandoning most of their AI initiatives jumped from 17% at the end of 2023 to 42% one year later. Nearly half of all proof-of-concepts (46% to be exact) never see production.

Management’s instinct was to double down: “Let’s bolt on autonomous agents so the concierge fixes its own errors.”

(Trust me when I say, I've heard this exact suggestion more times than I can count.)

Meanwhile, the folks actually talking to customers remain unconvinced that this solves any real problem. By June, our fictional insurer has spent more on prompt-engineering workshops than on actuarial training.

McKinsey reports that 65% of firms now "regularly" use gen-AI, double from last year. Yet, value capture clusters in marketing copy and code autocompletion. Not exactly the transformative use cases we were promised.

Respondents most often report generative AI adoption in their marketing- and-sales, product- and service-development, and IT functions. — Source: McKinsey

The story ends with a whimper, not a bang. The project gets quietly re-scoped. No agents. No autonomous filing. Instead, they build a stripped-down tool that suggests precedent clauses and highlights information gaps. A human underwriter must click "accept" for anything to happen.

It's boring. And it's in production.

🧭 THE HUMAN OVERRIDE

I've been calling this the 92% problem.

Models hit 92% accuracy. Executives budget for 99%. That 7% gap blows up your KPIs faster than you can say “digital transformation.”

Let’s fix that:

Diagnose task duration. METR's meta-evaluation research shows today's best models complete month-long tasks with only 50% reliability. Think about that. If your workflow spans regulatory quarters, you need manual checkpoints. Period.
Test agents as a new operating model, not as employees. Every company building agents (like this one) trumpet an AI-agent market rocketing from $5 billion to $47 billion by 2030, driven by OpenAI toolkits promising a 61 % speed lift on repetitive chores. You’ve probably (or hopefully) already learned how to take these numbers. They’re in the business of selling dreams. But experiment using agents as capacity, not autonomy. Agents draft. Humans dispatch.
Constrain the blast radius. Multi-agent orchestration sounds cinematic—until one rogue process loops for 10,000 API calls. Industry trackers warn that truly reliable agents may not arrive before 2026. So sandbox every chain. Log every decision.
Map legal exposure early. The EU AI Act made this non-negotiable. Deploy an “almost-there” model in credit scoring? Congratulations, you've inherited top-tier compliance burden. Doesn't matter how many disclaimers you slap on it.
Build the override lever first. Before shipping a single line of code, design your human-in-the-loop. Today’s AI excels at use cases where it’s easy to design a human fallback quickly and reliably. Could be a policy flagging regulated data. Or a literal “pause” button in the UI. That override is your safety net against the 8% that matters.
Measure boredom, not brilliance. Remember our insurance concierge? It only worked when we made it boring. Document retrieval. Clause suggestions. Nothing sexy. Boring tasks scale because they tolerate 92%. "End-to-end underwriting!" crashes on that missing 8%.

Framework in practice

The 3-Tier Reality Check

After debugging enough AI models to fill a server farm, here's my framework:

Tier 1: Drafts, titles, hooks, brainstorming, search, code hints
Works now. Hallucinations are cheap to catch. Saves time already.
Tier 2: Drafting contracts, customer chat
Viable with guardrails and audit logs.
Tier 3: Autonomous decisions affecting money, health, or liberty
Still research territory. Keep humans in control.

Send ALBERTO some feedback

🔥 SPARK

Seventy startups in Y Combinator’s Spring 2025 cohort pitch “agentic AI” as the new iPhone moment. But if these agents inherit today's 8% error rate, what exactly are they funding?

The collateral damage is already visible. 13,000 hallucinated articles were removed in Q1 2025. OpenAI's o3 model hallucinates 33% of the time on questions about public figures. Bad content is cheap to delete; bad medical advice is not.

So here’s the provocation I’m sending to policy-makers, CEOs, and every product manager building “the next big thing”:

What if the killer app isn't about perfecting AI at all? What if it's about radical acceptance of human fallibility?

We could channel these models into amplifying small-stakes creativity (email drafts, storyboard sketches) while keeping high-stakes judgment firmly human.

Where do you draw that line? Send me your sharpest take, and I'll feature the best ones next week.

Thank you for reading, hope it helped!

-a

P.S. If you’re struggling to make AI work for your specific context, I can help.

I run limited consulting slots for executives and professionals who need to move from AI confusion to concrete results. No generic frameworks. Just personalized roadmaps that match your industry, skills, and timeline.

Only taking 7 Quick-Wins and 4 Case Studies this month. They fill fast.

Check this out and apply to the program that best suits your context - takes 3 minutes.

Honest AI

Discussion about this post