UNDERSTAND AI

What is a transformer? The engine under modern AI

This is the most technical page in this hub — and you do not need it to use AI well. If you just want to be a clear-eyed, confident user, the practical pages have you covered and you can skip this one. But if you’ve seen the word “transformer” and wondered what it actually means — or you’ve noticed the “T” in GPT and gotten curious — here it is, in plain words, no math required. A transformer is the particular design, the engine, that almost every big AI model today is built on. One idea sits at its heart, and it’s a graspable one.

The “T” in GPT

You’ve probably typed the letters “GPT” without anyone telling you what they stand for. It’s Generative Pre-trained Transformer. Generative means it produces things — it writes. Pre-trained means it learned from a huge amount of text before you ever met it. And Transformer is the part this page is about: the kind of network the whole thing is built out of.

Here’s the useful fact to anchor on: the transformer is an architecture — a blueprint for how to wire up the system. It was introduced by researchers in 2017, and it worked so well that nearly all the major language models since — the ones behind ChatGPT, Claude, Gemini, and the rest — are built on it. So when people say “these models are all kind of similar under the hood,” this is a big part of what they mean. They share this engine.

The one idea: attention

If you take one thing from this page, take this. The core move a transformer makes is called attention, and the plain version is this: as the model reads, it decides which earlier words matter most for figuring out what comes next.

Think about how you read this sentence: “The trophy didn’t fit in the suitcase because it was too big.” To know what “it” means, you instinctively reach back to “trophy,” not “suitcase.” You weighed the earlier words and leaned on the one that mattered. That weighing — do that, mechanically, across every word — is the essence of attention.

Why is that a big deal? Because the older way of handling text treated it as a flat march — one word, then the next, then the next — which made it easy to lose the thread over a long passage. Attention lets the model look across the whole stretch of text at once and pull the relevant pieces forward, no matter how far back they sit. That’s how it keeps track of who “she” is three paragraphs later, or holds the thread of an argument across a long answer. It’s not reading start-to-finish and hoping to remember; it’s constantly asking, for each word, which of everything I’ve seen so far matters here?

Two caveats, because this is a simplification doing real work. First, “attention” is a technical term that covers a lot of machinery — the everyday-English word is a handle, not the full mechanism. Second, this is not the model “paying attention” the way a person does, with focus or interest. It’s a mathematical weighting of how strongly each word should pull on each other word. The name borrows a human word for something that is, underneath, arithmetic.

Why this design took over

The transformer didn’t become the standard by accident. Two things made it win:

It handles long-range context well. Because attention can reach across a whole passage at once, the model can connect a word near the end to one near the beginning without the thread fraying. That’s a real improvement over the designs that came before.
It runs in parallel. This sounds technical, but the payoff is simple. Older designs had to process text strictly in order, one step waiting on the last. A transformer can chew through much of the work at the same time, across enormous banks of hardware. That meant these models could be made vastly bigger — trained on far more text, with far more internal capacity — than would otherwise have been practical.

And that second point connects to something you may have heard: that a lot of what made recent AI so capable was simply scale — bigger models, more data, more computing power. The transformer is the design that made scaling up actually feasible. So in plain terms: this engine is a large part of why AI got good when it did.

Whether scaling keeps paying off the way it has is a genuinely open question that researchers disagree about. We’re describing why this design mattered up to now — not predicting where it goes.

What it is, and what it isn’t

Knowing the engine is called a transformer, and that its trick is attention, does not change what these systems fundamentally do: they still produce text by predicting a likely next word, over and over. The transformer is the clever machinery that makes those predictions good across long, complicated text — it isn’t a separate mind, an understanding, or a fact-checker bolted on. It’s a better way of doing the same job. So everything you may have read on the plainer pages still holds; this one just opens the hood.

The one-line version: a transformer is the engine almost all modern AI is built on — the “T” in GPT — and its core trick, called attention, is to weigh which earlier words matter most for guessing the next one; that’s how it keeps track of context across long text, and because it could be scaled up massively, it’s much of why recent AI got so capable.

Go deeper: tokens, embeddings, layers, and heads

A few more rungs of the ladder, lightly. A transformer doesn’t work on whole words but on tokens — chunks of text, often a word or part of a word. Each token is first turned into an embedding: a long list of numbers that places the token in a kind of meaning-space, so that tokens used in similar ways end up with similar numbers. Those number-vectors are what the model actually computes on. The system is then built as a stack of layers — the same kind of processing block repeated many times, the output of one feeding into the next — and modern models stack a great many of them. Inside each layer sits the attention step, and a transformer runs several of these in parallel, called attention heads; the rough intuition is that different heads can learn to track different kinds of relationship — one might follow grammatical links, another might track which earlier noun a pronoun refers to — though in practice what each head does is messy and not cleanly labeled. All of this is governed by billions of internal numbers called parameters, set during training. Two limits to keep in mind: first, this is still a simplified sketch — the real mathematics of attention (queries, keys, and values) is where the precision lives, and we’ve deliberately left it out. Second, exactly how a trained transformer represents and uses what it “knows” inside those layers is an active area of research, not a solved problem. People are still mapping it.

Where to go next

The simpler version first

If this page was a lot, start here: what a large language model actually is, in everyday words, no jargon.

What is an LLM? →

How it produces words

The next-word engine the transformer powers — walked through slowly, with examples.

How AI predicts the next word →

Where the patterns come from

What “training” really is — how a transformer learns from a huge amount of text before you ever type to it.

How AI is trained →

Look up a term

Token, parameters, and the rest of the jargon — the words on this page, defined plainly in one place.

The glossary →

This was the deep end on purpose. If you’d rather know how to use these tools well than how they’re built, that’s the more useful skill — head to using AI well or back to the Understand AI hub.