Why Classical ML Still Belongs in an AI-First World
There's a quiet temptation in every team building with AI right now. When you need to judge whether an AI-generated response is good, you have two options.
- Ask a domain expert to rate it. Sounds expensive and unscalable.
- Or ask another large language model to score it.
Sounds like a trick? Yes. Does it work? You bet. These so-called "LLM judges" read nuance, follow instructions in plain English, are easy to set up, and need only a handful of training examples. An LLM judge a.k.a. an autorater is simply an AI whose whole job is to grade another AI's work. It's less strange than it sounds, and I'll unpack it properly in a future piece.
And because it's so easy, it has quietly become the default. The first tool we reach for, and often the only one.
But "easiest to start with" and "right for the job" are not the same thing.
An LLM judge carries two quiet costs. Every judgment burns real compute. Two, it's overconfident; it rarely gives you a real sense of how uncertain it is. Meanwhile, the unglamorous machine-learning models many of us were happily using a few years ago — the boring, dependable ones, are the opposite: a bit harder to train, but incredibly cheap to run at scale, and honest about their own uncertainty, on exactly the kind of structured, numerical signals a language model struggles to weigh.
So here's the tension I keep seeing. Teams reach for the powerful, costly, overconfident tool by default, and quietly retire the cheap, well-behaved one — right when it would help most.
I don't think the answer is to pick a side.
Picture evaluation as a small panel of judges, each with a very different running cost. One is a flywheel: spin it up once, and it keeps turning on almost no energy. One is an engine: far more powerful, but it burns fuel every time it fires. And one is a human: the only true source of ground truth, and the one you can least afford to call on at scale.
Seen that way, the question stops being "which judge is best?" and becomes "who should weigh in on what?" The cheap model can quietly handle the unglamorous work — separating the easy, obvious cases from the genuinely hard ones, so the expensive judges spend their attention only where it counts. And the two systems can share what they know, each covering for the other's blind spots: one fluent in language, the other fluent in numbers.
There are genuinely intelligent ways to make the structured-data model and the LLM-based judge work in concert rather than in competition. I've been spending a lot of time in exactly this problem, and I'll share the fuller playbook — the experiments, the surprises, the places it breaks — down the road.
None of this is magic. Combining systems takes discipline: you need real data, you need to keep your models honest as the world shifts underneath them, and you need to be careful about which hard cases you actually act on. But the upside is large, and a surprising amount of it is nearly free.
We're entering an era where it's tempting to answer every question with the biggest, newest model. I'd resist that pull.
The skill that's about to matter most isn't picking the best model. It's staffing the panel — knowing which tool to trust for which question, and how to let them make each other better.
Sometimes the most valuable member of your AI panel is the boring model you made wait outside. Give it a seat at the table.
If you've felt this pull — swapping the unglamorous old model for the glamorous new one — I'd love to hear how you think about it.
#Evals #LLMs-as-a-judge #Autorater #Humans-as-a-Judge #Model-as-a-Judge