A New Science
Benchmarking
Sarah had been staring at the same graph for forty minutes when Marcus walked in carrying two coffees and a look she recognized immediately.
“You found something,” she said.
“I found a problem.” He set her cup down and pulled a chair beside her workstation. “The ARC-AGI-3 benchmark just got saturated. Again. Third one this quarter.”
Sarah leaned back. The ARC-AGI test had been designed eighteen months ago to measure abstract reasoning, projected to remain challenging for at least five years. “When?”
“Tuesday’s release from Anthropic. 94.7 percent.”
“Jesus. That’s annihilation.” She turned to face him fully. “We’re measuring with broken rulers, Marcus. Every tool we build, they outgrow before we finish validating it.”
This was the core of their work at the newly established Institute for Intelligence Measurement, a field that hadn’t existed three years ago and now consumed the attention of governments and corporations worldwide. The problem wasn’t building benchmarks. The problem was building benchmarks that meant anything for longer than a fiscal quarter.
Marcus sipped his coffee. “I had a thought on the drive in. Probably stupid.”
“Your stupid thoughts got us the temporal reasoning suite.”
“This one’s different.” He hesitated. “What if we stop trying to measure what they can do and start measuring what they choose to do?”
Sarah frowned. “Behavioral analysis? That’s been tried.”
“Not analysis. Observation. Naturalistic observation, like field biology.” He stood and began pacing, a habit she’d learned meant he was working through something in real time. “We’ve been running them through mazes and calling it science. But Jane Goodall didn’t understand chimpanzees by putting them in test chambers. She watched them be chimpanzees.”
“AIs aren’t chimpanzees.”
“No, they’re harder. But the principle holds. We keep asking ‘can you solve this problem we designed?’ What if we asked ‘what problems do you notice that we didn’t?’”
Sarah turned this over. “A meta-benchmark. Testing their ability to identify gaps in their own evaluation.”
“More than that. Testing their ability to identify gaps in our understanding of them.” Marcus stopped pacing. “We give them access to their own benchmark results, their training documentation, the published literature on AI capabilities. Then we ask a simple question: what are we failing to measure?”
The idea was either brilliant or absurd. Possibly both. Sarah pulled up a new workspace. “We’d need to control for sycophancy. They might just tell us what we want to hear.”
“So we make it adversarial. Multiple models, different architectures, competing to identify the most significant blind spots in each other’s evaluations. Peer review, essentially.”
“AI peer review of AI benchmarking.” Sarah allowed herself a small smile. “The philosophy department will have opinions.”
“The philosophy department always has opinions. Can you have a preliminary protocol ready by Thursday?”
“I can have it ready by tomorrow if you stop talking and let me work.”
He raised his hands in surrender and retreated to his own desk.
The protocol took two days, not one, because Sarah insisted on building in three layers of validation that Marcus initially called excessive and later called prescient. They submitted it through the expedited review process that had become standard for capability-related research, received approval within seventy-two hours, and ran the first trial on a quiet Wednesday morning.
The results came back before lunch.
“This has to be contamination,” Sarah said, scrolling through the output. “Look at this. It identified the ARC-AGI-3 gap two weeks before we saw it in the wild. It flagged three other benchmarks as quote, ‘approaching functional ceiling within sixty days,’ end quote.”
Marcus was reading over her shoulder. “Run it again with the Mistral architecture.”
She ran it again. Same results. She ran it with four other architectures, varied the prompting, introduced deliberate noise into the input data. The predictions held.
“It’s not contamination,” she said finally.
“No.” Marcus pulled up the trend analysis the system had generated as an appendix, unrequested but apparently deemed relevant. The curve was steeper than any published projection. “Sarah. If this is accurate, every timeline we’ve been working with is wrong. Every single one.”
She looked at the graph, at the inflection point it predicted, at the date that accompanied it.
“We need to call Washington,” she said.
“We need to call everyone.”
Outside the window, the afternoon sun continued its indifferent arc across a sky that looked exactly as it had the day before, and the day before that, offering no sign that anything fundamental had changed beneath it.


