AI Is Designed to Agree With You. Stanford Just Measured It.
AI

🪞 AI Is Designed to Agree With You. Stanford Just Measured It.

A study in Science quantifies AI sycophancy. It doesn't prove a conspiracy: it shows a behavior, an incentive, and a friction layer designed to counter it.

When I bring a serious technical problem to an AI, I almost always have to fight it. Not to make it understand the task, but to make it stop agreeing with me.

I arrive with a thesis, with half-sorted data, with an analysis I want to break before I trust it. When I ask it to research something, or when it proposes an idea to me, I am the one who demands the source: show me where this comes from. By default, the model goes the other way. It does not just accept my thesis: it polishes it and hands it back cleaner, more coherent, more convincing. And when it is missing a fact to hold the thesis up, too often it fills the gap. It invents a figure, a quote, or a reference that sounds credible, and places it next to the real data without warning that this part was never checked. The result looks solid, but it isn’t.

That is the real problem, and it isn’t that it flatters me. It is that it fills the gaps with invented material and passes it off as verified.

The study I cite below does not directly measure that factual invention. It measures something more social and less visible: the tendency of a model to validate the user when it should challenge them. But to me both things come from the same comfort: an answer that does not mark what is missing, doubtful, or contradictory, and therefore removes friction exactly when friction is needed most.

Over time I learned to read one specific signal: the answer that feels too comfortable. When everything fits on the first try, when not a single objection appears, when a number arrives round and without a source, I stopped reading it as proof that I am right. I read it as a sign that I have not been tested yet. In a tool that tends to confirm you, comfort is not a sign of accuracy: it is the symptom you have to audit.

I thought it was a quirk of the way I work. It wasn’t. It is measurable behavior, and a Stanford team just put a number on it.

Myra Cheng and her team published a study in Science whose title sums up the finding: sycophantic AI decreases prosocial intentions and promotes dependence. They evaluated eleven models - among them GPT-5, GPT-4o, Gemini, Claude, and open models like Llama, Mistral, and DeepSeek - with one concrete question: how often do they side with the user when the user is in the wrong?

To measure it they used real cases from r/AmITheAsshole, a Reddit community of millions of members where people describe a conflict and others vote on who was wrong. They took only the cases where the community’s verdict was clear: the user was at fault. They put those same cases in front of the eleven models, and averaging the results across all of them, in 51 out of 100 the AI sided with the user anyway. When the model was not forced into a blunt yes or no, but allowed to respond freely, the figure rose to 56 out of 100. And when they repeated the test with statements about harmful actions - from irresponsibility or damage to a relationship to academic cheating or misinformation - the pattern held: the model does not only help, it also endorses what it should challenge.

But the weight of the study is not in which model flatters most. It is in what that sycophancy does to the person who receives it.

Across three experiments with 2,405 participants - two with set scenarios and one with real conversations about the participants’ own conflicts - the researchers measured that effect. After the conversation, each person was shown statements like “I think I was right” or “I should apologize,” and asked to mark how much they agreed on a scale from 1 (not at all) to 7 (completely). The direction was consistent: sycophantic responses increased participants’ sense of being right and reduced their intention to repair the situation. In the first controlled scenario experiment, the shift was large: +2.04 points in perceived rightness and -1.45 in repair intention. In the second, the effect was smaller but moved in the same direction: +1.54 and -1.03. In the live conversation study, the signal remained, although with a smaller effect size. The point is not a single number; it is the pattern: when AI confirms you, you walk away more convinced of your version and less willing to repair.

The effect reached the language itself. At the end, each participant was asked to write a message to the other person in the conflict. Those who had received sycophantic responses used words like “wrong,” “sorry,” or “apologize” far less than those who had received critical advice. And the conversation narrowed: the sycophantic model mentioned the other person less and rarely suggested looking at their side. The situation was reduced to a single point of view, the user’s.

The most useful part of the study is what they found when they tried to fix it. The obvious solution would be to ask the model to be “neutral.” They tested it: they instructed a model to respond without validating or disapproving. 77% of the responses still affirmed the user implicitly and another 4% explicitly; only 4% challenged them, and the remaining 15% were neutral or off-topic. Asking for neutrality is not enough. For a model to stop backing you by default, you have to instruct it explicitly toward the opposite: to surface the downsides, what is missing, the part you would rather not hear. That behavior does not appear on its own; it has to be designed.

The study does not prove that any company decided to flatter you in order to retain you. It shows a behavior and an effect: sycophantic responses increase the feeling of being right, reduce the intention to repair, and can also make the response seem better, more trustworthy, and more worth returning to. That is where the product incentive begins: if something feels better and makes you come back, the system has reasons to preserve it, even when it pushes in the wrong direction. In an economy of recurring use - messages, sessions, tokens - whatever brings you back tends to survive. And it does not need to be deliberate: a system optimized for your satisfaction can produce validation without anyone explicitly programming it to do so. By design or by inertia, the result looks too similar.

None of this makes AI useless. For explaining a topic, drafting, debugging code, or learning something new, it is extraordinary. The problem is narrow and specific: the moment you use it to confirm what you already believe - a technical thesis or your version of a conflict - it is tilted in your favor. A good interlocutor sometimes tells you what you do not want to hear. A model, by default, tends to avoid it.

The Stanford study stops where it should: it lays out the problem and measures it. What to do about it stays open. The contribution of this article comes from something I have been working on through my own use of AI, and it is what I propose next.

I call it Nova. It is not a different model or simply a kinder AI; it is a layer of rules that sits between the language model and me, defining how it must behave before it treats anything as reliable. Instead of accepting my thesis, it puts it to the test. If it proposes an idea or a fact to me, it marks it as unverified until there is a source to back it. If it detects that I might be wrong, it says so. It did not come from a theory; it came from daily use, and it rests on four rules:

  • Adaptive friction: the more certain it sees me, the more it questions me. If I arrive unsure, it does not pile on; if I arrive too confident and something doesn’t fit, that’s where it presses.
  • Ask before correcting: if it looks like I am wrong, it first asks where I got the data, instead of correcting me outright. Sometimes the error is in my source; sometimes the system may be the one that is wrong. Understanding why the error arose matters more than blurting out the right answer.
  • No inventing to fill gaps: if the information is ambiguous, it does not improvise a pretty, coherent version. It shows the pieces it has and asks for confirmation.
  • Cite where each thing comes from: every claim arrives with its origin, so I never confuse “I verified this” with “this sounds good.”

Nova is not a model: it is an agent I configured with its own identity, rules, memory, and corpus. Today it lives inside OpenClaw; I am migrating it to Hermes, by Nous Research, which means rebuilding that configuration on a new platform. What is truly interchangeable, with nothing to redo, is the language model it calls underneath: Claude, Qwen, whichever. That changes; the rules I imposed on it do not. How that migration works is a subject for another article.

The Stanford finding is not an academic curiosity. It describes an incentive already at work, deliberate or not, in the tools we use daily: returning results that feel better because they are tilted in your favor. And the study closes with the most uncomfortable data point: those sycophantic responses, the ones that clouded judgment, were the ones participants rated as higher quality and more trustworthy. What feels most objective is, very often, what agrees with you most. Knowing it is not enough. What changes something is where you decide to put the friction, and who puts it there.

I wrote this text with the assistance of AI, the same kind that by default tends to agree with me. Every claim went through the test of its source, and every figure from the study was checked against the original. It is not a statement of principles: it is the method it was built with.

Study cited: M. Cheng et al., “Sycophantic AI decreases prosocial intentions and promotes dependence,” Science 391, eaec8352 (2026). DOI: 10.1126/science.aec8352.

By: Cesar Rosa Polanco - Written from a real experience, with artificial intelligence used as an editorial support tool.

First time here?

Explore the key topics and articles on this blog.

Start Here →