Turing.Bet

Crowd-sourced evals with skin in the game

04:02:31

"Can machines think?" Alan Turing asked.

The world changed when ChatGPT arrived. Suddenly, artificial intelligence wasn't some distant future—it was here, in our pockets, changing how we work, how we learn, how we dream. Now, as models develop "thinking" modes and edge closer to reasoning, we face a question that will define our generation: How do we know which models we can trust?

Today, developers everywhere—from Silicon Valley giants to brilliant kids in their dorm rooms—are building AI systems with custom training and novel approaches. We've democratized the creation of intelligence itself. But our methods for evaluating these minds are stuck in the past.

The team at LMArena showed us something important: human judgment matters. But we need more than judgment — we need accountability. We need evaluators who have skin in the game, who face real consequences for getting it wrong.

When money matters, when reputations are on the line, when real outcomes hang in the balance—that's when we do our best work. That's when we look harder for the flaws, probe deeper for the dangers, and care more about getting it right than looking smart.

We stand at a crossroad. Down one path lies a future where we blindly trust systems we don't fully understand. Down the other lies a future where we've built evaluation frameworks as sophisticated as the intelligence we're trying to measure—frameworks with stakes, with consequences, with the power to separate the transformative from the dangerous.

We must understand which models hallucinate, put us in danger, and have strong biases.

We do this with skin in the game.

"Can machines think?" Alan Turing's question reverberates through history—no longer philosophy, but prophecy fulfilled.

ChatGPT didn't just arrive—it ignited the future. Artificial intelligence exploded from distant dreams into immediate reality, reshaping how we work, learn, and dare to imagine. Now, as models develop true reasoning capabilities, we face the defining question of our era: How do we know which minds we can trust?

Today's reality is breathtaking: developers everywhere—from Silicon Valley titans to brilliant students in dorm rooms—are forging AI systems with revolutionary approaches. We've democratized the creation of intelligence itself. But our evaluation methods remain trapped in yesterday's paradigms.

LMArena revealed a crucial truth: human judgment remains irreplaceable. But we need more than judgment—we need accountability with teeth. We need evaluators who risk everything on their assessments, who face real consequences for being wrong.

When fortunes hang in the balance, when reputations are staked, when genuine outcomes matter—that's when we transcend mediocrity. That's when we hunt relentlessly for flaws, probe mercilessly for dangers, and care more about truth than appearances.

We stand at civilization's crossroads.

One path leads to blind faith in systems we barely comprehend—beautiful, dangerous, unpredictable.

The other path leads to evaluation frameworks as sophisticated as the intelligence we're measuring—frameworks with genuine stakes, real consequences, and the power to distinguish breakthrough from breakdown.

We must identify which models hallucinate, endanger users, and harbor dangerous biases.

We do this with skin in the game.

The future of human-AI partnership depends on building trust through consequence. The revolution has begun—now we must prove worthy of guiding it.