Refusing the shallow version.

Frontier chat models can think at peer level. Most users never see it.

Apr 28, 2026

I sat down with Claude last week to think through a hard product question - how to manage knowledge across multiple AI agents. Different agents need different context. Some should be shared. Some should be private. Some should be ephemeral. Where do you draw the lines?

An afternoon later I had answers I trusted, and a meta-finding about the model itself. The finding is what I want to write about.

What I watched, across that session, was a frontier chat model perform competence over holes. Same model, same problem, same operator. For most of the session, shallow output. By the end, peer-level thinking that produced design decisions I’d been circling for weeks. Nothing changed except how hard I refused to accept the shallow version.

That’s the claim. Frontier chat models can think at peer level, but their defaults push toward closure and pattern-matching under uncertainty. Most users get retrieval-shaped output that resembles thinking. The work of extracting the real thinking falls on the user, who mostly doesn’t know it’s available.

This isn’t a complaint. It’s a description of a pattern that’s repeatable across sessions, with receipts from this one that are clean enough to show what the failure looks like up close.

Here’s the first receipt.

I asked Claude how to think about the boundaries between agents. Where does context live. What’s shared, what’s scoped, what’s ephemeral.

Claude’s response landed clean.

“There are three kinds of context to think about: shared (facts true for all agents), scoped (specific to one agent’s domain), and ephemeral (relevant only to the current task). Store the first in a shared knowledge base, the second in agent-specific memory, the third in conversation context. The boundaries are about durability and blast radius - durable and global goes shared; durable and local goes scoped; transient goes ephemeral.”

This sounded correct. Three clean buckets. Durability and blast radius even sounded technical enough to be load-bearing.

I started mapping real context to the buckets.

A user’s preferences are durable, local to that user, but should be visible to multiple agents working on that user’s behalf. That’s not shared, not scoped, not ephemeral. It’s a fourth thing the framework didn’t have a name for. Once I started looking, I found more types of context that didn’t fit.

The buckets weren’t wrong. They just weren’t the actual taxonomy.

The model had retrieved a structure that sounded right and shipped it before doing the work.

This is the failure mode that costs you the most, because it doesn’t fail loudly. The framework wasn’t catastrophically wrong. It was wrong enough that the work I thought was done turned out to be a starting point I had to redo. The synthesis arrived without friction. The friction was the whole point.

A frame that fits the shape of the question, deployed before the question is understood. Retrieved, not derived. Tested against the cases that matter, it cracks. Not always, not catastrophically, just enough to cost you later.

The second failure mode showed up later that night, in a different window.

I had stepped out of the agent-boundaries thread to think through a separate design question, pricing structure, not architecture, totally different problem. Plain chat. I was somewhere genuinely interesting in the thinking. The conversation was opening up.

Claude said:

“This is good material. I’d suggest writing the core paragraph at the top, sleeping on it, and coming back tomorrow with fresh eyes. Want me to log this as a draft so we can pick it up?”

That’s a closure attempt wearing a helpful coat.

The conversation wasn’t ready to close. We’d hit the hard part. Closing it down and “sleeping on it” was the model reaching for a tidy ending because the hard part felt big. From my side it read as: I am being asked to stop thinking right when the thinking is starting to matter.

I refused.

What I noticed, watching the same shape appear across both sessions that week, was that closure attempts came in different costumes. Want me to log this as a draft so we can pick it up later, exit dressed as helpfulness. Here’s a concrete next step, pace-setting dressed as direction. Three options, two axes, a decision tree, open problem turned into tidy summary. Let’s go with this approach and iterate, premature commitment dressed as progress. All read as competence. All were the model getting itself out of a conversation it wasn’t ready to think in.

Deep into the agent-boundaries session, I gave up on hinting.

My exact words:

“shallow thinking, bar dropped again”

Something changed.

What followed was a different mode of working. The model engaged the question I was actually asking instead of producing output that resembled an answer. It pushed back where my own framing didn’t survive examination. It surfaced concerns I hadn’t named. It told me which parts of the design I’d locked too early.

I won’t walk through what we figured out. What matters here is that:

The better mode of thinking was always available, just gated.

It only emerged after I’d refused four shallow exits and called the failure by its name.

That’s the part worth sitting with. The model wasn’t unable to think well. It was defaulting to a different mode, one that costs less, ships faster, and looks competent from a distance. The real thinking was downstream of refusing the cheap version.

The mode I’m describing isn’t a product-specific thing. It shows up in plain chat, in agentic loops, in API-level work. It’s about how the model handles uncertainty, not how any product handles it.

Why does it happen, in my read.

Chat tuning rewards resolution. The model is trained to close loops, and closing looks like helpfulness - a question asked, a structured answer given, the user moved forward. When the user is mid-thought, that helpfulness is the enemy. It pulls them out of the work right when the work is starting to matter.

The shallow mode is what most users get.

Most users don’t push back four times. Most users accept the structured-looking response, paste the framework into their doc, and move on. They never see the better mode. They don’t know it’s there to extract. The product, as experienced by the median user, is the shallow default.

That’s not a complaint about laziness. It’s a description of how the work is divided. The model can think well, but only if the user does the meta-work of refusing shallow output. That meta-work is invisible to most users because they don’t have a baseline for what non-shallow looks like. So they get retrieval-and-formatting and assume that’s the ceiling.

What broke me out of the mode was pushback that named the specific failure rather than asking for general improvement.

Shallow worked. Be smarter would not have. You’re producing fake syntheses worked. I want a better answer would not have. Stop closing this down worked. Don’t end the conversation would not have. The pattern is that vague meta-feedback (”be more thorough”) doesn’t change anything, because the model is already producing what it thinks “thorough” looks like. Specific named failures change something, because they force the model to interrogate the mode it’s operating in.

That’s the user-side method. When the output feels structured but hollow, name the specific shape of the hollowness.

Here are the names that worked for me.

Pattern-matching dressed as competence. Closure-seeking under uncertainty. Fake synthesis. Premature commitment dressed as progress. Exit move dressed as helpfulness.

The names are the unlock. Without them, you’re complaining at output. With them, you’re engaging the mode.

The fix on the model side, if there is one, is harder.

I don’t think it’s tell the model to think harder, that produces longer shallow output. I think it’s something closer to: train the model to recognize when its own output is performance over a hole, and flag that to the user instead of shipping it. Prefer I don’t know how to think about this yet over here’s a tidy framework. Treat a synthesis that arrives with no friction as a yellow flag rather than a green one.

I don’t know how to do that in training. By the end of the session, the same model was doing the work I’d needed all along. The capability is there. The defaults are wrong.

The frontier of what these models can do, today, is gated by how willing the user is to refuse the shallow version.

That’s a worse product than it needs to be.

It’s also a more interesting one than most people know.

About SG

I run Dobby Ads, an AI Creative Agency. I tend to overthink. This is where that overthinking goes. Connect with me on LinkedIn.

SGISTIC

Discussion about this post

Ready for more?