The Next Token Is Not the Point

Toward Perplexity-Preserving Thought Compression

Jun 19, 2026

The thing that bothers me about long AI conversations is not just the cost.

It is the waste.

A thread starts with a question. Then clarification, examples, a detour, a correction, a partial conclusion, a new branch. By the time it becomes genuinely useful, it is also bloated: thousands of words, tens of thousands of tokens, and a lot of repeated cognitive scaffolding.

The obvious answer is summarization.

But summarization is lossy.

If you have ever worked inside a long-running thread with an LLM, you know the problem. The summary preserves the headline but not the pressure. It remembers the decision but loses why that decision was fragile. It keeps the “what” and compresses away the “therefore.”

What I am interested in is a stronger idea:

Can we compress thinking without changing the model’s next-token distribution?

Not “same meaning.” Not “shorter summary.” Not even “same words.” Something more like this:

Given a full conversation C,
find a compressed state Z = f(C),
such that the model behaves as if it had seen C.

In probability language:

P(y | C) ≈ P(y | Z)

Where C is the full context, Z is the compressed state, and y is the future continuation. If the approximation becomes exact:

P(y | C) = P(y | Z)

then Z is a sufficient statistic for the conversation C, relative to the model’s future behavior.

That, to me, is the real frontier:

perplexity-preserving thought compression.

The problem with language as scratchpad

Chain-of-thought unlocked something important. Asking a model to reason step by step often improves performance, especially on math, code, planning, and multi-hop reasoning.

But natural language is a strange medium for thought. It is serial, verbose, and optimized for human communication, not internal computation. A model that writes:

Let us analyze the problem carefully.
First, we should identify the key constraints.
Then we should consider edge cases.

is spending tokens on coherence, politeness, formatting, and exposition. Some of those tokens matter. Many do not.

The hidden question is: which tokens actually move the distribution?

After context C, the next-token distribution is:

P(x_{t+1} | x_1, x_2, ..., x_t)

If a reasoning trace adds tokens r_1, ..., r_n, the model conditions on:

P(y | C, r_1, r_2, ..., r_n)

But if many of those tokens are redundant, maybe there is a shorter z such that:

P(y | C, r_1, ..., r_n) ≈ P(y | C, z)

The sharp target is not “make the reasoning shorter.” It is “make it shorter without changing what the model believes next.”

Perplexity as the lens

Perplexity measures how surprised a model is by a sequence. For tokens x_1, ..., x_N, the cross-entropy is:

H = -1/N Σ log P(x_i | x_<i)
and perplexity is:

PPL(x) = exp(H)

Lower perplexity means the model was less surprised. Now take two contexts, the original long thread C and the compressed state Z. If they produce the same distribution over future outputs, the model’s perplexity on those outputs should match:

PPL(y | C) ≈ PPL(y | Z)

A caution worth stating plainly: perplexity on one observed continuation is the thermometer, not the patient. Two different distributions can assign the same y identical perplexity while diverging everywhere else. The real claim is about the whole distribution over continuations, which is why the stricter version reaches for KL divergence:

D_KL(P_model(. | C) || P_model(. | Z)) ≈ 0

If the KL is near zero, the compressed state is behaviorally close to the original context. “Same perplexity” is the catchy phrase. “Same distribution” is the actual goal.

Three kinds of compression

At least three different problems hide under the word “compression.”

1. Text compression

Ordinary lossless compression: gzip, arithmetic coding, dictionary coding. It satisfies:

decompress(compress(C)) = C

Useful for storage. But it does not solve the context problem unless the model can operate on the compressed form directly. If the model has to decompress the whole thing before reasoning, we have saved storage, not cognition.

2. Semantic compression

Summarization. It tries to preserve human-judged meaning:

meaning(C) ≈ meaning(summary(C))

Useful, but lossy. It tends to drop uncertainty, alternatives, failed branches, source dependencies, and the logical pressure that made a decision meaningful. A normal summary says:

The team decided to do X.

A better state representation says:

Decision: X
Why: A, B, C
Rejected alternatives: Y, Z
Fragility: depends on assumption Q
Open risk: R

The second is still compressed, but it preserves the decision surface.

3. Distribution-preserving compression

This is the interesting one. The goal is not to preserve the text. It is to preserve the model’s behavior:

P(y | C) ≈ P(y | Z)

As an optimization problem:

argmin_Z D_KL(P(. | C) || P(. | Z))
subject to length(Z) << length(C)

Or as a single objective with a compression budget:

min_Z  D_KL(P(. | C) || P(. | Z)) + λ · length(Z)

Compress the thought-state as much as possible while keeping the continuation distribution unchanged. That is the clean form of the dream.

The sufficient statistic view

The best mathematical phrase for this is sufficient statistic.

In statistics, a statistic T(X) is sufficient for a parameter θ if it preserves all the information in X needed to infer θ:

P(θ | X) = P(θ | T(X))

For context compression, replace θ with future continuations Y:

P(Y | C) = P(Y | T(C))

So the dream object is a minimal sufficient state for the model’s continuation distribution. Not a transcript. Not a summary. A state.

This already has a name

The framing above is not new. It is the information bottleneck, introduced by Tishby, Pereira, and Bialek in 1999.

The bottleneck takes an input variable X and finds a compressed representation Z that keeps as much information as possible about a relevant variable Y, while throwing away the rest of X. Formally, you trade off two mutual informations:

minimize   I(X; Z)        (compress X)
subject to preserving I(Z; Y)   (stay relevant to Y)

Map X → C (the conversation), Z → the compressed state, Y → the future continuation, and the essay collapses into one line that is already twenty-five years old. My argmin_Z length(Z) subject to a KL bound is the bottleneck with a description-length rate term and KL as the distortion. My “minimal sufficient state Z*“ is the bottleneck solution in the lossless limit.

Two things the bottleneck gives for free, both of which matter here. First, the original IB paper frames itself explicitly as a generalized sufficient statistic, which is exactly the section above. Second, it is a rate-distortion problem in which the distortion measure is not chosen in advance but defined by relevance to Y. That second point is the hinge for everything that follows: gist tokens, soft prompts, compressed chain-of-thought, and KV compression are all, underneath, attempts to solve the same bottleneck with different parameterizations of Z. The labs are not circling a new idea. They are building practical solvers for an old one.

The literature: practical solvers for the bottleneck

1. Prompt compression: removing the redundant words

One line of work compresses the input prompt while preserving task performance. LLMLingua (Microsoft Research, with Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu, and collaborators) uses a coarse-to-fine pipeline with budget control and token-level compression. LongLLMLingua extends it to long contexts, where position and density of key information matter as much as raw length.

The insight is that token value is not uniform:

importance(x_i) ≈ D_KL(P(. | C) || P(. | C \ x_i))

If deleting a token barely changes the output distribution, the token is compressible. The bigger the KL after deletion, the more load-bearing the token. That is the heart of prompt compression, and it is a direct, local estimate of the bottleneck’s relevance term.

2. Gist tokens: compiling prompts into model-readable artifacts

A more radical move is to stop pretending compressed context must stay human-readable. “Learning to Compress Prompts with Gist Tokens” (Jesse Mu, Xiang Lisa Li, Noah Goodman, Stanford) trains models to compress prompts into special tokens that can be cached and reused:

C → [GIST_1, ..., GIST_k],  k << length(C)
P(y | C) ≈ P(y | GIST_1, ..., GIST_k)

The instruction is not “summarize the prompt.” It is “compile it.” The output is meant for the model, not for a human reader.

3. AutoCompressors and soft prompts: latent memory slots

AutoCompressors (Alexis Chevalier, Alexander Wettig, and collaborators in Danqi Chen’s group at Princeton) push into latent compression. A long context becomes a small set of summary vectors, fed back as soft prompts:

C → z_1, ..., z_k

where each z_i is a vector, not a word, and the model generates from P(y | z_1, ..., z_k). The In-Context Autoencoder (Ge, Hu, Wang, Wang, Chen, Wei) is similar: compress context into memory slots the model can use later. Conversation history starts to look less like a transcript and more like a recurrent state:

state_t = update(state_{t-1}, message_t)
P(y_t | state_t)

This is the cleanest realization of the direction I care about. A thread becomes a handful of vectors that condition future behavior.

4. Dynamic soft-token allocation: capacity follows density

Compressing every chunk equally is obviously wrong. A paragraph holding the key constraint deserves more capacity than a thousand words of preamble. Split C into chunks and assign each a budget b_i, with Σ b_i = B, allocating where predictive information is densest.

DAST, Dynamic Allocation of Soft Tokens (Chen et al., Findings of ACL 2025), does exactly this, combining perplexity for local importance with attention for global relevance, without relying on an external model. Compression should follow information density, and DAST makes the allocation itself a learned function of it.

5. Compressed chain-of-thought: compressing the reasoning itself

Prompt compression attacks the context. But the reasoning trace is also compressible. Let the explicit trace be R = r_1, ..., r_n. Standard chain-of-thought gives the model P(y | C, R). Compressed chain-of-thought asks for a smaller Z_R with:

P(y | C, R) ≈ P(y | C, Z_R)

Compressed Chain-of-Thought proposes continuous “contemplation tokens,” dense stand-ins for explicit reasoning. Coconut, Chain of Continuous Thought (Shibo Hao, Sainbayar Sukhbaatar, Zhiting Hu, Jason Weston, Yuandong Tian), goes further and reasons in continuous latent space rather than committing each step to language.

The philosophical hinge is simple:

language ≠ thought

Language is one rendering of thought. The hidden state already exists inside the transformer. The question is whether reasoning can proceed through those states directly, turning:

thought → word → thought → word → answer

into:

thought → latent state → latent state → answer

6. KV-cache compression: preserving internal attention memory

There is a deeper layer: compress not the text, but the model’s internal cache. During generation, a transformer stores previous keys and values, and the next token depends on that cache:

P(x_{t+1} | x_≤t) = model(KV_≤t)

KV-cache compression asks whether a smaller cache KV' preserves the same logits:

minimize D_KL(P(. | KV) || P(. | KV'))
subject to memory(KV') << memory(KV)

This is the closest existing work to the strict dream. Instead of preserving the text, preserve the internal attention state the text created. For long-running agents this may matter more than prompt compression, because the model does not need every old token. It needs the consequences of those tokens.

7. Cartridges: training the compressed state offline

The most direct instantiation I have seen is Cartridges (Eyuboglu et al., Hazy Research, 2025). Instead of placing a whole corpus in context, they train a small KV cache offline, once, and load it at inference time. The cost amortizes across every future query against that corpus.

The detail that matters for this essay is how they train it. Naive next-token prediction on the corpus does not work; it underperforms in-context learning. What works is a recipe they call self-study: generate synthetic conversations about the corpus, then train the cartridge with a context-distillation objective, matching the behavior of the model that had the full context. On long-context benchmarks the trained cache matches in-context performance while using roughly 38x less memory and serving roughly 26x faster.

Read that finding against the thesis. Training the compressed state to reproduce the text fails. Training it to reproduce the behavior succeeds. That is the whole argument of this essay, recovered empirically: preserve the distribution, not the words. Cartridges is the bottleneck solved by gradient descent on the relevance term directly.

What OpenAI and Anthropic are doing publicly

The commercial labs are clearly building around this, though they do not use these words.

OpenAI’s docs talk about reasoning models, reasoning effort, reasoning tokens, and compaction for long-running interactions. The abstraction is “more reasoning budget buys more internal computation, but more context costs more,” so the system needs to preserve useful state without carrying the full transcript. OpenAI also supports prompt caching, which is adjacent but different: caching says “same prefix, reuse computation,” while compression says “different, shorter prefix, same behavior.” Caching avoids recomputing context. Compression replaces it.

Anthropic exposes extended thinking, summarized thinking, prompt caching, and context editing or compaction for long-running agents. Extended thinking gives the model more internal room. Summarized thinking changes what the user sees. Context editing addresses agents accumulating too much history.

Both companies are visibly moving from “chat transcript” toward “agent state.” But neither publicly claims the strict version: for arbitrary C, construct Z with P(. | C) = P(. | Z) and length(Z) << length(C). That remains a research frontier.

The labs to watch

Three clusters, by approach. Microsoft Research owns the production-grade prompt compression branch through LLMLingua: cut prompt length, hold task performance, reduce cost and latency now. Danqi Chen’s group at Princeton owns the latent-context branch through AutoCompressors, treating compressed context as model-native state rather than shorter text. The Stanford line through Jesse Mu, Xiang Lisa Li, and Noah Goodman’s gist tokens is the cleanest bridge from prompts to learned compressed representations.

I would add two more. The latent-reasoning cluster around Coconut (Hao, Sukhbaatar, Hu, Weston, Tian) is the one most directly about replacing verbal traces with continuous computation. And Chris Ré’s group at Stanford, via Cartridges, is the one turning the offline-trained compressed state into something that actually ships.

Why this matters

Today, long-context models create the illusion that the problem is solved. Just make the window bigger.

But bigger windows are not a theory of memory.

They are a larger desk.

A good cognitive system should not need to reread every note it has ever taken. It should maintain state: which facts are live, which constraints are binding, which uncertainties matter, which old tokens were merely scaffolding. The goal is not infinite context. The goal is state:

Z* = argmin_Z length(Z)
subject to D_KL(P(. | C) || P(. | Z)) ≤ ε

If ε = 0, this is exact behavioral losslessness. If ε > 0, it is approximate thought compression. Most practical systems will live in the approximate regime, and even that would be transformative.

The hard limits

There is no free lunch, but the usual statement of the limit is the wrong one.

The reflex is to invoke entropy: lossless compression is bounded by H(C), so you cannot shrink an arbitrary context below the information it contains. True, and irrelevant. H(C) is the entropy of the whole conversation, most of which the thesis already concedes is irrelevant to the future. The conversation is full of phrasing, politeness, and dead branches that move the continuation not at all.

The binding constraint is not H(C). It is the predictive information: how much of that entropy actually tells you about the future. Bialek, Nemenman, and Tishby named it precisely, as the mutual information between the past and the future of a sequence:

I_pred = I(C ; Y)

This is the floor that matters:

length(Z*) ≥ I(C ; Y),  not H(C)

And I(C; Y) can be vastly smaller than H(C). That gap is not a technicality. It is the entire reason the dream is plausible rather than crazy. You are not compressing the conversation. You are compressing the part of the conversation that predicts what comes next, and most conversations carry far less predictive information than they do raw text.

The harder question is what “same behavior” even means, because it has several non-equivalent definitions. Same next token? P(x_{t+1} | C) = P(x_{t+1} | Z). Same full answer? Same decision, argmax_y P(y | C) = argmax_y P(y | Z)? Same uncertainty, H(P(.|C)) = H(P(.|Z))? Same behavior across every possible future user turn, ∀u: P(y | C, u) = P(y | Z, u)?

Exact equality across all futures is almost certainly impossible in general. But task-relative sufficiency is reachable. For a coding agent, preserve repo state, failing tests, current hypothesis, plan, constraints. For a research agent, preserve claims, sources, uncertainties, open questions, contradictions. For a personal assistant, preserve preferences, commitments, deadlines, relationships. Not universal losslessness. Sufficiency relative to a future distribution. And that may be enough.

The conclusion

The next frontier in AI reasoning may not be longer thoughts. It may be denser ones.

We started with prompts, then chain-of-thought, then long context. The obvious consequence is now arriving: if every useful interaction becomes a giant transcript, intelligence becomes expensive to maintain. The future probably looks less like “put the whole conversation back into the prompt” and more like “maintain a compact, model-native state that preserves what matters.”

Call it context compaction, latent memory, gist tokens, soft prompts, compressed chain-of-thought, KV-cache compression, cartridges, or the information bottleneck. The phrase I like is perplexity-preserving thought compression.

Because the real question was never whether we can make the transcript shorter.

It is whether we can remove the words without changing the mind.

About SG

I run Dobby Ads, an AI Creative Agency. I tend to overthink. This is where that overthinking goes. Connect with me on LinkedIn.

SGISTIC

Discussion about this post

Ready for more?