Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy

7 chapters7 takeaways18 key terms5 questions

Overview

This video explains the fundamental concepts behind large language models like ChatGPT by building a simplified version from scratch. It details the Transformer architecture, focusing on the self-attention mechanism. The explanation progresses from basic language modeling principles, character-level tokenization, and data preparation to implementing a rudimentary language model and then introducing the core components of the Transformer, including positional embeddings and the self-attention mechanism itself. The goal is to demystify how these powerful AI models process and generate text by walking through the code and underlying mathematical ideas.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

ChatGPT is a language model that generates text sequentially, predicting the next word or token based on previous ones.
Language models are probabilistic, meaning they can produce different outputs for the same input prompt.
The Transformer architecture, introduced in the paper 'Attention Is All You Need,' is the foundation of modern language models like GPT.
GPT stands for Generative Pre-trained Transformer, highlighting its core components.

Understanding what language models are and their probabilistic nature is crucial for appreciating their capabilities and limitations when interacting with AI.

Asking ChatGPT to write a haiku about AI and observing its sequential, probabilistic text generation.

To understand the internals, we'll build a simplified character-level Transformer model.
We'll use a small dataset, 'tiny Shakespeare,' to train the model on character sequences.
Tokenization involves converting text into a sequence of integers; here, we use character-level tokenization.
The dataset is split into training and validation sets to monitor for overfitting.

Working with a small dataset and character-level tokenization simplifies the complex process of training a large language model, making the core mechanics more accessible.

Representing the text 'hello' as a sequence of integers based on a character-to-integer mapping derived from the 'tiny Shakespeare' dataset.

Training involves feeding the model chunks of text (blocks) rather than the entire dataset at once.
Each block of text contains multiple training examples, where the model predicts the next character at each position.
Data is batched for efficiency, processing multiple independent chunks of text simultaneously on GPUs.
The input data (X) and target data (Y) are created by sliding a window of 'block size' across the tokenized text.

Understanding how data is chunked, batched, and prepared for training is essential for efficient model training and for grasping how the model learns from sequential data.

A block of 9 characters contains 8 training examples, where the model learns to predict the 2nd character given the 1st, the 3rd given the 1st and 2nd, and so on, up to predicting the 9th given the first 8.

A bigram model predicts the next character based solely on the identity of the current character.
It uses an embedding table to represent each character and outputs logits (scores) for the next character.
The loss function (cross-entropy) measures how well the model's predictions match the actual next characters.
Generating text with a bigram model results in random, incoherent output because it lacks context.

The bigram model serves as a simple baseline to understand basic language modeling concepts like embeddings, loss calculation, and generation before introducing more complex architectures.

Given the character 'h', the bigram model predicts the next character based only on the learned probabilities associated with 'h'.

The Transformer overcomes the limitations of simple models by incorporating positional information and self-attention.
Positional embeddings are added to token embeddings to inform the model about the position of each token in the sequence.
The model architecture is modified to include a language modeling head that maps embeddings to vocabulary-sized logits.
The Transformer processes tokens not just by their identity but also by their position within the sequence.

Understanding positional embeddings is key to realizing how Transformers maintain sequence order, which is critical for understanding language context.

The embedding for the word 'the' at the beginning of a sentence is combined with an embedding representing 'position 1' to create a unique representation for that specific instance of 'the'.

Self-attention allows each token to weigh the importance of all other tokens (including itself) in the sequence when computing its representation.
Tokens emit 'query' and 'key' vectors; the dot product of a query with all keys determines attention scores (affinities).
These scores are masked to prevent future information leakage and then normalized (softmax) to create attention weights.
The weighted sum of 'value' vectors (also emitted by tokens) forms the output of the self-attention layer, incorporating context.

Self-attention is the core innovation of Transformers, enabling them to capture long-range dependencies and contextual relationships within text far more effectively than previous architectures.

When processing the word 'it' in 'The animal didn't cross the street because it was too tired,' self-attention helps the model determine that 'it' refers to 'the animal' by calculating high attention scores between the query of 'it' and the key of 'animal'.

A self-attention block consists of linear layers to generate queries, keys, and values from input embeddings.
The attention scores are calculated via dot products between queries and keys.
Masking (e.g., lower triangular) prevents attention to future tokens, crucial for autoregressive generation.
The final output is a weighted sum of value vectors, where weights are derived from the masked and softmaxed attention scores.

Implementing the self-attention block step-by-step reveals the intricate calculations involved in how a model dynamically focuses on different parts of the input sequence.

Calculating the dot product between the 'query' vector for the word 'was' and the 'key' vector for the word 'animal' to determine how much attention 'was' should pay to 'animal'.

Key takeaways

1Language models like ChatGPT are probabilistic systems that generate text by predicting the next token in a sequence.
2The Transformer architecture, particularly its self-attention mechanism, is the driving force behind modern LLMs.
3Tokenization converts text into numerical representations that models can process, with character-level being a simple starting point.
4Training involves feeding the model data in batches of fixed-size blocks, where each block contains multiple prediction tasks.
5Positional embeddings are crucial for Transformers to understand the order of tokens in a sequence.
6Self-attention allows tokens to dynamically weigh the importance of other tokens, enabling context-aware representations.
7The core idea of self-attention is computing affinities between 'queries' and 'keys' to create attention weights for aggregating 'values'.

Key terms

Language ModelTransformer ArchitectureGenerative Pre-trained Transformer (GPT)TokenizationCharacter-level TokenizerBlock SizeBatchingBigram ModelEmbedding TableLogitsCross-Entropy LossPositional EmbeddingsSelf-AttentionQueryKeyValueAttention ScoresSoftmax

Test your understanding

1How does the probabilistic nature of language models like ChatGPT influence the output for a given prompt?
2What is the role of tokenization in preparing text data for a language model, and what are the trade-offs of character-level tokenization?
3Explain why data is processed in batches and fixed-size blocks during the training of Transformer models.
4How does the self-attention mechanism allow a Transformer model to understand the context of a word within a sentence?
5What are the 'query,' 'key,' and 'value' vectors in self-attention, and how are they used to compute the final output of an attention head?