How a Language Model Is Actually Built

When we think about models like ChatGPT or Claude, we tend to imagine a brilliant neural network architecture, an ingenious algorithm devised by geniuses. It’s the comfortable story: artificial intelligence as an elegant mathematical engineering problem.

Reality, as usual, is more down-to-earth.

A friend pointed me to this talk he’d seen circulating on X, and that led me to the official source: the lecture Yann Dubois gave at Stanford (CS229) on how large language models (LLMs) are built. I listened carefully, and afterward put together a reference and study document to help me understand it better - part of my ongoing attempt to keep learning from a world that moves faster than anyone can read. The lecture was published in August 2024 and now has more than 1.8 million views.[1] This article is, in large part, a summary of what Dubois explains in that lecture, translated into more accessible language for those of us who use these tools without building them.

Dubois opens with an observation worth underlining. According to him, five components matter when building an LLM: architecture, training algorithm, data, evaluation, and systems. And he adds that academia obsesses over the first two, while industry - the people who actually deliver models to the world - wins on the last three.

That inversion of priorities changes the story. What makes an LLM work, Dubois explains, isn’t a secret architectural discovery; it’s data, evaluation, and systems. The Transformer architecture has been public for years. Competitive advantage is usually decided in the rest.

With that framing, Dubois divides the process into three phases.

1. Pretraining: modeling the internet

This is where what the author calls the “base model” is born - think of GPT-3 before it became ChatGPT. You take a Transformer neural network and teach it one task, repeated trillions of times: predict the next chunk of text.

To do this, you process an absurd amount of data. Dubois uses Llama 3 as a scale reference - more than 15 trillion tokens -, a magnitude consistent with the later public documentation for Llama 3.1.[2] One of the main sources is Common Crawl, which publishes enormous monthly web crawls.[3] But the internet is messy. A random page from Common Crawl, Dubois says, is almost always unusable: broken HTML, forum headers repeated a thousand times, sentences cut off halfway.

That’s why the real work of pretraining, he explains, isn’t in the architecture. It’s in cleaning up the data: extracting useful text, filtering unwanted content, removing duplicates, applying rules to separate good from garbage, training classifiers that distinguish “probably useful” from “probably noise”. At frontier teams, a substantial part of the engineering effort goes only to this.

At the end of this phase the model can continue almost any text coherently. What it still doesn’t know is how to behave as an assistant. Dubois gives an example that illustrates the problem: if you ask a base model “explain the moon landing to a six-year-old”, it will probably reply with another similar question, because on the internet a question is often followed by more questions, not by answers.

2. Post-training: turning it into an assistant

This is where the base model gets transformed into something like ChatGPT or Claude. Dubois describes it in two steps.

Supervised fine-tuning (SFT). You take the pretrained model and show it carefully crafted examples of instruction and ideal response. The most counterintuitive finding Dubois points to is that you do not need that much data: a few thousand carefully chosen examples can matter more than a much larger pile of mediocre ones.[4] The model is not learning new knowledge from scratch; most of that already comes from pretraining. What it is learning is which type of internet user to imitate.

Reinforcement learning from human feedback (RLHF). Dubois points out a problem with SFT: it clones written human behavior. If a human cites a reference the model never saw during pretraining, the system can internalize the wrong pattern: “when they ask me this, make up a reference that sounds plausible.” That is, hallucination by design.

RLHF flips the approach. Instead of cloning answers, the model generates two candidate responses, a human (or a model acting as judge) picks the preferred one, and algorithms like PPO (Proximal Policy Optimization) or DPO (Direct Preference Optimization) adjust the model to produce more of what people prefer. This was, according to Dubois, the big difference between GPT-3 and ChatGPT.

There is a curious consequence to this. Dubois explains that after RLHF, the system behaves less like a pure language model and more like a policy optimized to generate preferred responses. That is why perplexity (a metric that measures how much uncertainty the model has about the next word) becomes much less informative for evaluating assistants like Claude.

3. Systems: making it work in the real world

If compute is the bottleneck - and Dubois insists it is - then making graphics processing units (GPUs) run at full throttle stops being an infrastructure detail and becomes a competitive advantage. The author highlights three optimization levers.

Low-precision arithmetic. Instead of computing in 32 bits, you compute in 16. Fewer bits means less data movement, less memory use, and more speed. In deep learning, the extra decimals usually do not matter: there is already so much noise in training that going from 0.0134 to 0.013 rarely changes the result.

Operator fusion. Each PyTorch operation, by default, moves data between the GPU’s general memory and its compute cores. Two operations in a row mean two round trips. Tools like torch.compile can fuse operations and reduce that memory traffic, often with noticeable gains without rewriting the model.[5]

Inference optimizations. Inference - or, in plainer language, the moment we use the trained model - is what happens every time it answers a prompt. Every question we ask ChatGPT or Claude goes through a round of inference. Dubois underlines something many people forget: in mass-market products, serving the model to millions of users can end up costing more than training it. That is why techniques like quantization (using smaller numbers at response time), key-value caching (KV caching, reusing earlier computations), and speculative decoding matter as much as training tricks. They are, he says, the reason a smaller, well-tuned model can sometimes beat a frontier model in production.

What this means for those of us who use these tools

Up to this point I have summarized Dubois. Let me close with three observations I take away for my own use, as someone who relies on these tools every day without building them.

First: the model in front of us is a distillation of filtered internet. It is not an oracle. It reflects the biases and quality of its training data. When it gives an answer about something niche, there is a real chance it is guessing something that sounds convincing instead of recalling solid facts.

Second: RLHF introduces a personality. What we call “Claude’s tone” or “ChatGPT’s style” is not spontaneous emergence; it is human labeler preferences baked into the model. Understanding that makes it easier to read its answers critically - and to write better instructions to get what we want from it.

Third: the bottleneck is not the model’s intelligence; it is the cost of inference. The companies that win the next decade will not necessarily be the ones with the biggest model; they will be the ones that manage to serve it more cheaply. That has direct implications for which tools we choose for our organizations.

AI still looks like magic. But when you look inside the process, what you see is disciplined engineering: clean data, rigorous evaluation, optimized systems. Far from dimming the shine, that gives us back something important: the responsibility to use it with judgment.

Base source

Lecture: Stanford CS229 | Machine Learning | Building Large Language Models (LLMs)

Speaker: Yann Dubois - yanndubs.github.io

Published: August 28, 2024

YouTube video: Watch the lecture

Stanford Online AI programs: stanford.io/ai

Notes on figures and sources

[1] Video, date, and views. The base source for the article is Yann Dubois’s lecture on YouTube, published on August 28, 2024. The view count refers to the video page as consulted on April 18, 2026: YouTube.

[2] Llama pretraining scale. In the lecture, Dubois uses Llama 3 as a reference point for pretraining scale. As a verifiable public comparison, Meta’s official Llama 3.1 405B model card says the model was pretrained on ~15 trillion tokens from publicly available sources: meta-llama/Llama-3.1-405B-Instruct.

[3] Common Crawl. In the lecture, Dubois uses Common Crawl as an example of the scale of web data available for pretraining. Rather than freezing a monthly number that can change, I point here to two official sources: Common Crawl’s monthly archive statistics dashboard and the March 2026 release note, which reports 1.97 billion web pages: official statistics and March 2026 Crawl Archive Now Available.

[4] LIMA and SFT size. The claim that a small but carefully curated instruction set can matter more than a much larger mediocre one is grounded in LIMA: Less Is More for Alignment. The paper abstract reports a model fine-tuned on 1,000 carefully curated prompts and responses, without RLHF: arXiv.

[5] torch.compile and performance. In the lecture, Dubois presents it as an example of strong performance gains; for updated context, PyTorch’s official documentation says torch.compile can speed up code with minimal code changes, but results depend on the model, hardware, and workload. The PyTorch 2.0 page reported an average 43% training speedup across 163 open-source models on A100, and the tuning guide explains the logic behind kernel fusion: PyTorch 2.x, Introduction to torch.compile, and Performance Tuning Guide.

By Cesar Rosa Polanco - Written from a real experience, with AI’s help in shaping the draft.

🧠 How a Language Model Is Actually Built

How a Language Model Is Actually Built

1. Pretraining: modeling the internet

2. Post-training: turning it into an assistant

3. Systems: making it work in the real world

What this means for those of us who use these tools

Base source

Notes on figures and sources

First time here?

🧠 How a Language Model Is Actually Built

How a Language Model Is Actually Built

1. Pretraining: modeling the internet

2. Post-training: turning it into an assistant

3. Systems: making it work in the real world

What this means for those of us who use these tools

Base source

Notes on figures and sources

First time here?

Related articles

🛠️ How to Install Claude Desktop with Cowork on Windows 11

📋 How to Update OpenClaw Without Breaking Anything

🔧 I Updated OpenClaw and Everything Broke