Transformer · Autoregressive Decoding

Token Prediction Engine

Watch an LLM build text, one token at a time

How Token Prediction Works

You provide a starting text. This becomes the context window, the sequence of tokens the model can see. Everything the model knows about what to say next comes from this window.

The model reads every token and assigns attention. Not all words matter equally for predicting what comes next. The model highlights the most relevant tokens with a golden glow, showing where it is “looking” most closely.

It produces a probability distribution. The model does not just pick one word. It calculates a probability for every possible next token in its vocabulary (tens of thousands of options), then shows you the top candidates. Wider bars mean the model is more confident in that choice.

One token is sampled and appended. The chosen token gets added to the context, and the whole process repeats. This is why it is called autoregressive: each prediction feeds back into the next one. Every sentence an AI has ever written was built exactly this way.

In plain English: the model reads what it has so far, guesses the most likely next word, appends it, and repeats. One word at a time, every time. There is no grand plan for the whole sentence. Each step is a fresh prediction based solely on what came before.

Why this matters: this is the exact mechanism behind every response from GPT, Gemini, Claude, and any other large language model. When an AI writes a poem, answers a question, or translates a sentence, it is running this same loop: read, predict, append, repeat. Understanding this loop means understanding the core of how modern AI “thinks.”

What Are Tokens?

A token is not always a full word. It can be a word, part of a word, or even just punctuation. The word “understanding” might be split into “under” + “standing.” Models work with tokens, not words, because it lets them handle any language or text pattern efficiently.

The context window you see above is the model’s entire memory for this conversation. Everything outside it simply does not exist for the model.

What Is Attention?

Attention is how the model decides which tokens matter most for the current prediction. When predicting the next word after “The capital of France is,” the model pays heavy attention to “France” and “capital” while mostly ignoring “The.”

The golden glow on each token shows its attention weight. Brighter glow means the model considers that token more important for deciding what comes next.

Why Probabilities, Not Certainties?

Language is inherently ambiguous. After “I went to the,” the next word could be “store,” “park,” “doctor,” or thousands of other options. The model assigns a probability to each one, reflecting how likely it thinks each continuation is.

This is what makes AI creative: it does not always pick the most likely option. By sampling from the distribution, it can surprise you with unexpected but perfectly sensible completions.

The Big Picture

Every AI chatbot, every code assistant, every AI-generated article uses this exact loop. There is no secret intelligence hiding behind the scenes, just a model that has learned statistical patterns from enormous amounts of text, applied one token at a time.

What you see in this demo is real. Each step makes a live API call to Claude. The probabilities, the attention weights, the chosen tokens are all coming from the same system that powers real AI applications.

←Previous: Euler’s Epicycles