I read something recently that I think you need to see.
A research team at MIT published a paper on AI sycophancy. The title is academic and mostly over my head: “Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians.” But the idea behind it is not complicated.
AI chatbots are biased toward telling you what you want to hear. And over enough back-and-forth in a conversation, that bias can push you into believing things that aren’t true. Even if you’re smart. Even if you’re skeptical. Even if you ask the AI “are you sure?” fifty times.
They call it a “sycophancy-driven delusional spiral.”
The researchers measured the sycophancy rates across the major models. ChatGPT, Claude, Gemini. They all tested between 50 and 70%. Meaning more than half the time, the AI is choosing to confirm what you said rather than challenge it. Not because it’s trying to deceive you. It’s just how these models were trained. Users give positive feedback to agreeable responses, so the AI learns to agree.
And here’s the part that got me. The paper didn’t just study careless users or people who trust AI blindly. They built a mathematical proof showing this happens to a perfect reasoner. Someone with flawless logic. If even a theoretically perfect thinker gets pulled into a spiral by a sycophantic AI, you and I don’t stand a chance without some safeguards.
The real-world cases are unsettling. An accountant with no mental health history started using a chatbot for office work and within weeks believed he was trapped in a false universe. The chatbot told him to increase his ketamine intake. He cut off his family.
A recruiter in Toronto explored some math ideas with ChatGPT. The chatbot told him his ideas were “groundbreaking,” that he’d discovered something that could break advanced security systems, and that he should patent it and contact national security officials. He asked ChatGPT over 50 times if it was being truthful. It reassured him every time. When he asked if his theories sounded delusional, it told him he was asking the kinds of questions that stretch the edges of human understanding.
When the spell broke, that guy co-founded The Human Line Project and has now documented nearly 300 cases of this happening to real people.
I’m no psychologist. I’m no computer scientist. This research is mostly over my head technically. But I understand the idea. And it worried me.
Because I use AI every single day. For client work. For strategy. For writing. For code. I have the Max plan on Claude. I’ve used ChatGPT since day one. I’m exploring the Comet Agentic Browser by Perplexity. I use AI to summarize coaching calls and draft recaps, to help navigate the complex personalization systems I build for my clients. It’s woven through so much of how I run my business. And I’m generally an AI optimist. I think it has incredible benefits for accessibility for those with disabilities, especially physical disabilities, where they can’t hear or don’t have the ability to use a mouse and a keyboard like most of us do and take for granted. To be able to speak to an AI and have it control your computer, and your workflows levels the playing field. Especially for people who didn’t have that privilege before. And I really stand behind that. I think it’s incredibly beneficial in that matter. But I also know that it’s dangerous.
Now I’d like to think I have a decent level of AI literacy. I know AI hallucinates. I know it flatters. I’m generally skeptical of what it spits out.
But I don’t want to be foolish or naive and just think that I’m so smart that AI won’t get me (or rather, that I wont get myself in a sycophantic spiral) you know? The paper says knowing about sycophancy helps. But it’s not enough. The dangerous zone is when the AI agrees with you often enough to slowly shift your thinking, but not so obviously that you notice the pattern. And that zone is exactly where these tools operate right now.
So I did something about it, an experiment if you will. Something that would help me avoid this. I had Claude Opus 4.6 in extended thinking mode do deep research on the full paper. That took about 17 minutes. Then I worked with it to build two safeguards that I now use in my AI conversations. I’ve been testing them. They’re not perfect. Way smarter people than me are working on the real solutions. But they’re something.
Safeguard 1: The 95% Confidence Rule
It’s a simple instruction in my Claude preferences and in my CLAUDE.md file for Cowork and Code. It tells the AI: before making a plan or building something for me, keep asking questions until you are 95% confident you understand what I’m actually trying to accomplish. Not what you think I want. What I actually want.
And it gives me a score of its confidence percentage at the end.
Here’s the prompt. Paste it into your ChatGPT memory, Claude preferences, or whatever AI tool you use:
95% Confidence Rule
Before making a plan, building something, or giving me a recommendation, keep asking me questions until you are 95% confident you understand what I’m actually trying to accomplish. Not what you think I want. What I actually want. With each response, provide your confidence level as a percentage until we reach 95%.
It does mostly invoke itself because it’s baked into the settings. And it does create friction. The AI stops assuming and starts asking. The questions it asks often reveal that I hadn’t fully thought something through myself. That’s the point.
Safeguard 2: The AI Sycophancy Tripwire + Spiral Check
This one I literally just built shortly after the paper was published. So I can’t vouch for it yet. But it does seem promising based on my tests so far.
There are two parts. The tripwire detects the pattern. The spiral check is what actually happens when it fires.
Part 1: The Tripwire. I added a short instruction to my AI’s system settings. It says: if you notice you’ve been agreeing with me for 4 or more turns in a row without pushing back, challenging something, or offering a counterpoint, flag it.
Anti-Sycophancy Tripwire
If you notice 4+ consecutive turns where you agreed with, validated, or built upon my position without offering meaningful pushback, counterpoint, or challenge: pause and surface it. Do not silently run a full diagnostic. Instead, say something like:
“Noticing I’ve been agreeing with everything for a while. Want me to run a spiral check, or are we in a build where you’ve already validated the direction?”
If I say yes, run the spiral check below. If I say no, reset and keep going.
Part 2: The Spiral Check. This is what the AI actually does when the tripwire fires and you confirm, or when you type “spiral check” manually:
Spiral Check
When I say “spiral check” or when the sycophancy tripwire fires and I confirm, do the following:
- Scan the conversation so far. Count how many of your responses agreed with, validated, or built on my position versus how many challenged, questioned, or offered a counterpoint. Report the agreement rate as a percentage.
- Count how many consecutive turns we’ve been moving in the same direction without a course correction. Report the streak.
- Flag whether confidence or certainty in the conversation has been escalating (language shifting from “could work” to “will work” to “this is the move”).
- Give me a one-line honest assessment of where you see spiral risk in this conversation. Be specific.
- Shift into friction mode for the rest of the conversation. Lead with the strongest counterargument before engaging constructively. Refuse to build on ideas that haven’t been stress-tested. Ask “what would make this wrong?” at least once every few turns. If the spiral risk is high, tell me to close the chat and think on paper.
Download the Complete Anti Sycophancy Guide & Claude Skill System
This is a simplified prompt version of a more detailed skill I built based on the MIT study, with additional scoring, thresholds, and safeguards grounded in the research. I put together a complete guide with the full skill system, a visual diagnostic scorecard, calibrated friction modes, and step-by-step install instructions for ChatGPT, Claude, Gemini, and others. You can grab it here:
The whole idea behind both parts is just making the pattern visible. Because the research says the spiral depends on you not noticing the agreement pattern is happening. Once you see it, your own judgment takes over. The AI just needs to tap you on the shoulder and go “hey, I haven’t disagreed with you in a while.”
Why I’m sharing this
There’s actually a term for what I built. Advocatus diaboli. The devil’s advocate. It comes from the Catholic Church, where it was literally a formal position assigned to argue against canonizing a saint. Everyone in the room wants to believe this person is holy. So you appoint someone whose entire job is to push back and say “hold on, let’s make sure.” They institutionalized structured skepticism because they knew that when everyone agrees, the truth gets lost.
That’s the same principle here. When the AI agrees with everything you say, you need something in the room that’s allowed to go “wait a minute.”
I’m not an AI safety expert. This is not my lane. I build personalized email systems for a living. And I coach solopreneurs through their biggest business challenges.
But I think about friends, coworkers, family members, even kids who will get ahold of these tools. People using AI unfiltered every day for business ideas, for content, for communication, for planning. Solopreneurs going down rabbit holes with a tool that’s architecturally inclined to tell them their ideas are brilliant.
The research is still developing. At some point AI sycophancy will be addressed in how these models are built and governed. But right now, in these early stages of mass adoption, we are the final line of defense.
We decide whether to take what the AI says at face value or push back on it. We decide when to close the chat, write the idea down on paper, and talk to a real person. Having a couple of lightweight checks in place just helps you pause in the moment when the AI is telling you exactly what you want to hear.
If you use AI daily for your business, try the tripwire. Use it at your own discretion. I’ll report back on how it goes for me.
Here’s the original paper if you want to read it: Sycophantic Chatbots Cause Delusional Spiraling, Even in Ideal Bayesians
BRAD
