The Recursion InstituteINDEPENDENT RESEARCH IN AI SAFETY

UNDERSTAND AI · IN THE NEWS

How to read the news about AI without getting spun

AI coverage runs hot. One week a headline says a model “passed the bar exam” or “can now do a doctor’s job”; the next, a story says the whole thing is a bubble about to pop. Both can leave you more confused, not less. The good news: you don’t need to be a technologist to read these stories well. You need a short, reusable set of questions — the same ones whether the news is breathless or doom-laden. This page hands you that checklist. We’re not going to tell you what to think about any company or where this all goes; that’s exactly the kind of bet we’re teaching you to be skeptical of. We’re going to show you how to look.

A demo is not the product

The single most common way to get spun is to mistake a polished demonstration for the everyday tool. A demo is a curated best case: the impressive example, run under good conditions, often picked from many tries and edited for the reel. The shipped product is what you actually get on a random Tuesday with your own messy question — which is usually less reliable, sometimes much less.

This isn’t a trick unique to AI; it’s how launches have always worked. But it matters more here because these systems are uneven: genuinely impressive on the cases someone chose to show you, and shaky on neighboring cases they didn’t. When you see a stunning clip, the honest question isn’t “can it do that?” It’s “how often, on inputs nobody hand-picked?” That gap — between can and reliably — is where most of the spin lives.

Who’s talking, and what are they selling?

Every voice in an AI story has an incentive. That doesn’t make any of them liars — it makes them worth placing. A quick map:

You don’t need to be cynical about all of them. You just follow the incentive: ask what this person gains if you believe them. The most trustworthy claim is one that cuts against the speaker’s interest — a company admitting a limit, a critic conceding something works.

What “passes the exam” actually means

A huge share of AI headlines rest on a score: “passes the bar,” “scores in the top 10%,” “beats doctors on a medical test.” These come from benchmarks — standardized tests used to measure systems. They’re useful, and they are not what the headline implies.

A test score measures performance on that test, under test conditions. It does not measure general competence in the real job. A person who passes the bar exam still spent years learning judgment a test can’t capture; the exam is a narrow slice of being a lawyer, not the whole of it. The same caution applies to a model. There’s also a quieter problem: because these systems learned from enormous amounts of text, some test material may resemble what they already absorbed — a bit like having seen the answer key. A high score is a real signal worth noting. It is not a promise that the thing is reliable where it counts.

Go deeper: how to read a benchmark claim

When a story leans on a score, a few questions separate signal from theater. Which test, and who ran it? A maker reporting its own result on a test it chose is weaker than an independent group running many systems the same way. What does the test actually contain? Multiple-choice questions reward pattern-matching that may not transfer to open-ended real work. Could the answers have leaked into training? If a model effectively studied the answer key, the score inflates. Is the comparison fair? “Beats humans” often means a specific group under specific limits, not experts working normally. And does the score predict the thing you care about? A high mark on a curated test frequently doesn’t carry over to messy, real-world use — the move from predicting the next word on a clean exam to handling your actual situation is exactly where reliability tends to drop. None of this means benchmarks are worthless. It means a score is one data point, not a verdict.

“An AI can now do X” — read the small print

When a headline says a system “can now” do something, it almost always means sometimes, under some conditions, for some inputs — not always, not anywhere, not for you. Capability and reliable capability are different things, and the difference is the whole story.

A tool that writes a working bit of code, or summarizes a document, or answers a hard question correctly — some of the time — is genuinely useful, and it’s fair to be impressed. But “some of the time” is precisely the part the headline drops. For tasks where being wrong is cheap and easy to spot, “often right” is plenty. For tasks where a confident mistake is costly, the failure rate is the only number that matters, and it’s the one rarely printed. When you read “can now do X,” silently add: how often, and what happens when it’s wrong? And remember that a fluent, assured-sounding answer isn’t the same as a correct one — these systems can state false things in the same confident tone as true ones, which is what makes an unstated failure rate worth asking about.

One screenshot is not evidence

A viral screenshot — an AI saying something brilliant, or something unhinged — is an anecdote, not a finding. It might be cherry-picked from a hundred boring tries, edited, set up with a leading prompt, or simply a one-off. A single striking example can be real and still tell you almost nothing about how the system behaves in general.

A real study is different: many cases, a method you could repeat, conditions written down, results that hold up when someone else runs them. When a story rests on “look what it did,” ask whether that’s one moment or a measured pattern. The honest version of the question runs both ways — a single impressive answer doesn’t prove a system is brilliant, and a single alarming answer doesn’t prove it’s dangerous. Either way, you want the pattern, not the snapshot.

A checklist to copy and keep

Next time an AI headline lands, run it through these. Most spin doesn’t survive even a few of them.

  1. Demo or daily use? Is this a curated best case, or how the tool actually behaves on ordinary inputs?
  2. Who benefits if I believe this? Follow the incentive — company, investor, critic, all of them have one.
  3. Can, or reliably? Does “can do X” quietly mean “sometimes, in some conditions”?
  4. What does the score actually measure? A test result is performance on that test, not general competence.
  5. One example or a real study? Anecdote and screenshot, or many cases with a repeatable method?
  6. What’s the failure rate — and the cost of a miss? Both are usually left out, and both decide whether it’s safe to rely on.
  7. Is anyone predicting the future? Forecasts about jobs, “the race,” or bubbles are bets, not facts — weigh them as opinion.

The one-line version: AI news gets spun by mistaking demos for products, scores for competence, and “can” for “reliably” — so for any headline, ask who benefits, whether it’s a cherry-picked example or a real study, and what the unstated failure rate is. Describe what’s shown; don’t buy the forecast.

Where to go next

The AI landscape

A neutral map of who makes which assistant and what actually differs — useful for placing any product story.

Read →

What an AI answer costs

The real resources behind “just ask the AI” — helpful for reading the business and energy coverage clearly.

Read →

AI claims, decoded

A plain-language translator for the loaded words — “reasons,” “understands,” “knows” — that fill the headlines.

Read →

Why AI makes things up

Why a confident, fluent answer can still be wrong — the reason an unstated failure rate matters so much.

Read →

Reading the news well is its own skill, separate from using the tools well. If you want to pressure-test your own sense of what these systems do, try the self-check — or browse the rest of Understand AI for the plain-language map underneath the headlines.