Claude Science – Who checks the Chemistry?

Anthropic shipped Claude Science this week — an AI workbench that folds PubMed, Jupyter, R, a cluster terminal and sixty-odd scientific databases into one conversation. The launch opens with a sentence you already know by heart: “AI has the potential to dramatically accelerate the pace of scientific discovery.”

That sentence has been the first line of every AI-for-science pitch since AlphaFold folded its first protein. So let’s not spend the day being impressed that it’s fast. It’s fast. The demos are real, the speedups are on real workloads, and I’ll be using the thing. Two questions are more interesting, and more uncomfortable: what happens to scientific scrutiny when the acceleration is genuine — and which “science” are we actually talking about?

Two numbers, and they’re not the same kind of number

The launch carries two headline results, and the difference between them tells you almost everything.

At the UCSF Brain Tumor Center, an epidemiologist ran germline workups for glioma studies in roughly one-tenth the time they used to take — and, crucially, his lab independently validated the output. That’s the number with teeth. It’s measurable, it’s on a real problem, and a human confirmed it afterwards. Take it seriously.

The other number is louder and softer at the same time. A team at the Allen Institute went from as long as two years per literature review to something like ten reviews, many over a hundred pages, built by a pipeline of about twenty custom agents. Impressive. But more reviews is not better reviews, and the phrase doing the heavy lifting in that story is “citations checked over by reviewer agents.” Hold onto that phrase. We’ll come back to it, because it’s the whole ballgame.

It’s called Science. It ships as biology.

Here’s the thing that should bother anyone who doesn’t spend their days looking at cells. It’s called Claude Science. What actually ships is a drug-discovery pipeline.

Look under the hood. Every pre-configured domain is life sciences — genomics, single-cell, proteomics, structural biology, cheminformatics. All sixty-plus databases are biological: UniProt, PDB, Ensembl, ChEMBL, GEO. The model integrations are NVIDIA’s BioNeMo — Evo 2, Boltz-2, OpenFold3, all life-science models. All three flagship case studies are biomedical. And it was unveiled at an event for pharmaceutical executives and biotech founders. This isn’t “science.” It’s Claude Code for drug discovery — valuable, but not the thing the name promises.

Where’s physics? Materials? Astronomy? The one place they appear is the eligibility fine print for the discounted academic plan, which graciously extends to “chemistry, math, computer science, and physics.” No solvers, no simulation connectors, no domain models. In principle you can point the environment at anything; in practice, outside biology you’re building every connector yourself, and at that point you have Claude Code with extra steps. “AI for Science” has a habit of quietly meaning “AI for biology,” and this is the most polished example yet. When a frontier lab annexes the whole word science for one of its branches, the rest of us should at least notice the land grab.

And yet, credit where it’s earned. The moat is real, and it’s precise: integration, not intelligence. The value isn’t a cleverer model; it’s forty tabs collapsed into one conversation. The best decision in the launch is also the least quotable one: every figure ships with the exact code, the environment, and the full message history that made it. Reproducibility baked in, not bolted on. In a field still nursing a replication crisis that’s worth more than any benchmark, and it’s the one pattern every other tool should copy, biology or not. If you’re buying this for a lab, that’s your criteria: integration and auditability, not “how smart is the model.” The model is close to incidental.

The bit I don’t trust yet

Now the reviewer agent: the actor-critic setup where one agent generates and a separate agent checks the work for accuracy and citation fidelity, “self-correcting as it goes.”

Some of that is fine. An agent flagging an untraceable number, or a figure that doesn’t match the code that supposedly made it; that’s a deterministic check against ground truth. Trustworthy. But an LLM verifying another LLM’s citations is not a solved feature. It’s the hardest unsolved problem in applied AI wearing a lab coat: persistence across a long loop without hallucinating your own intermediate state and compounding the error downstream.

We have seen this film. It was the promise of AutoGPT. It was the promise of every agentic framework since. This is HAL calmly reporting that the AE-35 unit is functioning perfectly, right up until the moment it very much isn’t; and the failure mode that matters is the confident wrong answer you have no easy way to cross-examine.

The tell is in Anthropic’s own copy. The UCSF lab independently validated the results. The responsible users are treating the output as a hypothesis to be checked, not an answer to be trusted. That’s the correct posture. The question is whether it survives contact with a deadline.

The risk isn’t the speed. It’s the temptation.

Here’s the part the reflexive sceptics get wrong. The danger isn’t that Claude Science is bad at science. It’s that it’s good — and a real 10× is exactly what tempts a stretched, underfunded lab to stop doing the slow part.

Validation is the expensive, unglamorous, un-fundable bit. It’s the bit a grant committee doesn’t reward and a preprint server doesn’t require. When the machine is fast and reports that it already checked its own work, the marginal cost of skipping human verification drops to near zero — and things with near-zero cost get done less and less. Automating the labour of science is a straightforward gift. Automating the scrutiny of science is how the replication crisis meets its most enthusiastic new participant.

A lab that runs ten times faster and checks its own work ten times less isn’t doing better science. It’s just wrong sooner.

What to actually do

If you run a lab: adopt it now for the tedious plumbing — data wrangling, environment setup, first-pass figures. That’s pure upside. Keep humans on validation, and treat the reviewer agent as a triage layer that surfaces things to check, not a referee whose word is final.
If you’re a PI or you sit on a review panel: the audit trail is the product. The code, the environment, the message history attached to every figure — that’s the thing to demand, open, and interrogate. The moment it becomes a rubber stamp you glance at and wave through, you’ve automated away the one thing science genuinely cannot outsource.
If you work outside biology — physics, materials, earth science, anything else: temper the excitement, because there’s no domain tooling here for you yet. Treat it as a general coding-and-compute environment with a biology front-end, and budget for building the connectors your field needs before it earns a place in your workflow.

The speed was always going to arrive. It’s the easy part, and every lab will have it soon enough. The hard part — the part that decides whether this accelerates discovery or just accelerates the production of confident nonsense — is whether we keep marking our own homework, and whether we notice how narrow the map has quietly become. A tool that runs ten times faster, checks its own work ten times less, and quietly redefines science to mean biology isn’t accelerating discovery across the board. It’s accelerating one corner of it, and hoping we don’t read the fine print. Tools don’t answer that for us. We do.

RogueLoop. Where AI meets real-world innovation.