Pages

Wednesday, January 21, 2026

How Large Language Models (LLMs) Are Built

 This structured guide breaks down the five-step journey of building a Large Language Model (LLM), moving from raw internet data to a sophisticated, safe AI assistant.


1. Data Curation: "Garbage In, Garbage Out"

Before any training starts, developers must gather and clean a humongous volume of data. The goal is to maximize the Scaling Law, which states that as data size and compute power increase, the model's performance improves.

  • Data Collection: Scraping the internet, GitHub (code), Wikipedia, and books. For example, GPT-5 is trained on roughly 70 trillion tokens.

  • Cleaning & Filtering: Removing HTML tags, illegal content, and low-quality text.

  • Deduplication: Removing repetitive information using algorithms like SHA-1 (exact matches) or MinHash/LSH (near duplicates).

  • De-identification: Removing Personal Identifiable Information (PII) to ensure privacy.

  • Human Annotation: Companies like Scale.com hire experts (lawyers, doctors) to create high-quality Q&A pairs for specialized knowledge.


2. Tokenization: Translating Text to Numbers

Computers do not understand words; they understand numbers. Tokenization is the process of converting raw text into "tokens" (chunks of characters).

  • Embeddings: Each token is converted into a numerical vector (embedding) that represents its meaning.

  • Byte Pair Encoding (BPE): A common method used by GPT to break words into the most frequent sub-word patterns (e.g., "eating" becomes "eat" + "ing").

  • Language Independence: Some modern models operate on raw UTF-8 bytes to avoid needing a specific tokenizer for every language.


3. Model Architecture: The Transformer Backbone

The Transformer architecture is the engine behind all modern LLMs.

  • Attention Mechanism: This allows the model to understand context. For example, in the sentences "bank of a river" and "bank account," the attention mechanism helps the model "attend" to surrounding words to know which "bank" is being discussed.

  • Mixture of Experts (MoE): Used by models like DeepSeek to improve efficiency. Instead of using the whole brain for every word, it only activates relevant "expert" neurons.

  • Optimizers & Activations: New mathematical techniques (like the Muon optimizer or SwiGLU activation) are used to make training faster and cheaper.


4. Model Training at Scale

Training is the most expensive part of the process, costing hundreds of millions of dollars in compute power.

The Hardware

  • GPUs: Nvidia's H100 or RTX chips are the industry standard.

  • Data Centers: Massive "Mega-factories" (like OpenAI’s Stargate project) house thousands of GPUs working in parallel.

  • CUDA: A C++ framework used by engineers to squeeze every bit of performance out of the GPU transistors.

The Training Stages

StagePurposeAnalogy
Pre-trainingPredict the next token using raw internet data.General reading and learning.
Supervised Fine-Tuning (SFT)Teaching the model to follow instructions and answer questions.Preparing for a specific exam.
Preference Tuning (RLHF/DPO)Aligning the model with human values (safety, politeness).Learning social manners and ethics.
Verifiable RewardsUsing automated tests (like code compilers) to reward correct answers.Checking math homework with a calculator.

5. Evaluation: Measuring Intelligence

Testing an LLM is difficult because it is probabilistic—it gives different answers every time.

  • Semantic Matching: Using Cosine Similarity to check if the model's answer means the same thing as the expected answer, even if the words are different.

  • LLM as a Judge: Using a stronger model (like GPT-4o) to grade the answers of a smaller model.

  • Technical Benchmarks:

    • MMLU: Tests general knowledge and reasoning.

    • HumanEval / SWE-bench: Tests coding ability.

    • MATH: Evaluates complex mathematical problem-solving.

No comments:

Post a Comment