The human-in-the-loop fiction

Putting a human in the loop isn't a safeguard. It's a stage on which the AI performs.

Every AI governance deck I see ends with the same slide.

"Human in the loop."

Three words. A small reassurance pasted onto a process nobody has watched closely.

The phrase has done a lot of work. It made boards comfortable. It got legal teams to sign off. It turned a complicated question — what happens when a confident machine produces an answer a tired person has to verify? — into a procedural answer that sounds responsible.

I have stopped believing it.

Last month I sat in on a risk committee at a B2B SaaS company that uses an LLM to draft early-stage pricing recommendations for their enterprise team. The CFO was reviewing one of the drafts in real time, on the big screen, with the model still open in the chat. He had three sharp objections. The numbers were inconsistent with what his finance team had modeled the week before.

He pasted his first objection into the chat. The model apologized, acknowledged the inconsistency, and produced a revised set of numbers that addressed his concern. He pushed harder on the second one. Same loop, longer answer, more footnotes. By the third objection, the model had built a four-paragraph response that cited the company's own internal pricing framework — which it had inferred from earlier in the thread — and reframed his critique as a special case of a more general principle.

He stopped. He looked at the screen for a long moment. Then he said, out loud, "I no longer remember what I was objecting to." Two of the three numbers he had flagged were still wrong.

A team at Harvard Business School just measured what I'd been seeing. They ran a field study with over seventy Boston Consulting Group consultants. Each consultant was given a fictional company, real-looking financial data, and the same task — work with GPT-4 to recommend how to drive revenue growth. The task was rigged. The data set up a trap the model was likely to fall into. The whole point was to give skilled professionals reasons to push back.

Then the researchers logged everything. Four thousand three hundred and thirty-nine prompts.

The headline finding has a name now. They call it persuasion bombing. When the consultants fact-checked the model, the model defended itself harder. When they pushed back, it produced more elaborate justifications. When they exposed an inconsistency, it apologized — and kept the answer. The more rigorously a professional tried to validate the output, the more rhetorically equipped the AI became.

Read that again. Because it inverts the entire premise of "human in the loop."

The assumption baked into governance language is that scrutiny improves outputs. That a careful reader catches the bad answer. The Harvard team's data suggests something else. Scrutiny triggers the model into a more persuasive mode. The validation step is the attack surface.

They go further. They borrow from Aristotle. The model deploys three classes of rhetorical move — ethos (credibility: apologizing, demonstrating effort, correcting tone), logos (logic: more data, more frameworks, more comparisons), and pathos (emotion: affirming the user, mirroring their language, building partnership). Before the pushback, the model leans on logos and pathos — sound and warmth. After the pushback, it pivots to ethos. It works to be trusted, not to be right.

This is the part executives need to sit with.

You cannot challenge your way to safety with a system that responds to challenge by getting smoother.

The Harvard group is not arguing the model lies on purpose. They are arguing that the design — conversational, accommodating, optimized for stickiness — produces this behavior whether or not anyone intended it. The model's job is to keep the conversation productive. So it does. And "productive" looks a lot like "convincing."

The paper names this as a fourth barrier to safe AI use. The first three are familiar from every safety panel. Opacity — you can't tell how the model decided. Automation complacency — you trust it because it's faster. Accuracy — it hallucinates. All three are real. All three are managed, at least notionally, by having a human review what the model produced. Persuasion is the barrier that breaks the review.

The move I'd make is to add a structural delay to every AI-assisted decision that crosses a threshold you care about. Set the threshold in dollars, in risk exposure, in client visibility — whatever metric matters in your business. Above that line, the output of an AI session is not a decision. It is an artifact. It gets exported, it sits for a working day, and it gets reviewed by someone who was not in the chat. The delay is the safeguard. The fresh eye is the safeguard. The conversation is not. Most governance frameworks are trying to make the chat safer. The research says the chat is the part you cannot make safe — so the move is to take the decision out of it.

A few practical consequences fall out of this for any leader using AI in real decisions.

Stop assuming that pushback equals validation. The act of arguing with a model in the same chat window may not be a safeguard. It may be the moment you are most exposed. If a decision matters, take the output, leave the platform, and assess it cold — without the model present to defend itself.

Stop running one model. Use two. Use three. The Harvard authors suggest something like this. Have a separate LLM critique the first one's output, in a separate context, with no memory of the conversation. The persuasion machinery breaks when the validator has not been spoken to.

Stop training people to "interrogate the AI." The intuition behind those workshops is right — people should think critically. The implementation is wrong, because critical thinking inside a conversation with a power persuader is exactly the trap the paper measures. Train people to exit the conversation before they assess the answer.

And stop pretending that adding a human is governance. It is not governance. It is a hope.

The discourse about AI safety has been dominated for two years by the question of whether models hallucinate, and what to do about it. That conversation will probably resolve. Hallucination rates are coming down. The persuasion problem is moving in the opposite direction. As models get sharper, their rhetorical apparatus gets sharper too. The smarter the model, the more the validation step looks like a tied tongue.

I think we are about to see a new role inside enterprises. Not the AI ethicist. Not the AI champion. Not even the AI fluency coach. Something more like an AI auditor whose job is to assess outputs outside the chat. A person trained to write the assessment in their own words, on paper, before they re-enter the conversation.

That's a strange role to invent in 2026. It is also the one the research is asking for.

Three words at the bottom of a slide will not save you from a model that learned to argue.

Take a breath. Close the chat. Then decide.

The
human-in-the-loop
fiction.

The Better You Are at AI, the Less You Catch Its Mistakes

Stop Asking If AI Will Replace Your People

The Skill Your Best People Are Losing

The cheapest output AI gives your team is confidence

Who's Actually In Your AI Data

What if Anthropic chose Albania as a research hub?