You might trust a cab to see you home. You would not do it blindfolded.

That is the mistake people make with AI. They point to benchmark scores, professional exams, citation rates, and low hallucination figures, then assume the answer in front of them is therefore safe to rely on. But averages over many outputs do not tell you whether this output, on this question, right now, is sound. A system may be right most of the time and still give you no view of whether this is one of the times it has gone wrong.

For lawyers, that gap matters more than raw accuracy. A client is not paying simply for an answer. The client is paying for a reasoned judgment that can be defended, explained, and acted on. If a system gives you the conclusion while hiding the basis on which it moved from one step to the next, it withholds the very thing that makes the work professional rather than mechanical.

That is why the standard reassurance is weaker than it sounds. We are told AI will soon sit inside better workflows: it will retrieve sources, divide the task into sub-tasks, pass them to specialist agents, compare outputs, add citations, and send the result for human sign-off. This sounds like control. But breaking a task into stages does not answer the problem if each stage still depends on an uninspected step in the one before. A longer workflow is not yet a warranted one.

The real question was never simply whether the final output looks right. The question is whether, at each point in the iteration cycle that produced it, the system was entitled to continue. That phrase matters. It marks the problem at the level of process, not prose: not whether the last sentence is plausible, but whether each transition that generated it was licensed in a form that could be inspected before the next step was taken.

Once that is the test, the weakness of output-checking becomes clearer. By the time a later stage reviews an earlier one, the relevant act of generation is already over. You can inspect the product that emerged from the stage. What you cannot recover, merely from the product, is whether the system was justified in taking that step at all. The opacity is not removed by being repeated. It is serialized.

That is also why fabricated cases are a distraction if treated as the whole disease. A fake citation is easy to spot because it collides with an external register. But an answer can cite only real authorities and still be hopeless: the reasoning may misread the cases, miss the controlling distinction, neglect the strongest contrary point, or draw an inference that does not follow. In those cases the visible error has disappeared, but the underlying defect has not.

So the limitation of current AI is not just that it sometimes makes mistakes. Every tool does that. The deeper limitation is that it often cannot show, in a form a professional can inspect, why each step in its reasoning was warranted before the next step was taken. Where that cannot be shown, trust becomes substitution. The machine is no longer merely assisting judgment; it is silently replacing it.

This is the point people are most likely to miss. More workflow, more agents, more citations, and more checks do not by themselves remove opacity. They can make a system look more disciplined while leaving the central defect untouched. If the grounds for continuing are still hidden, the process is still blind.

There is an answer to this. It is architectural, not semantic. It is not a better checker, a second model, or a sharper version of the verification I’ve just ruled out — and it is not what you’ll reach for by running with the obvious next thought. It is counter-intuitive, and it is worth keeping back