
How We Scaled Kimi K2.5 | Zhilin Yang's full GTC 2026 Keynote
Kimi AI
Overview
This presentation details the strategies and innovations behind scaling large language models, specifically focusing on the Kimi K2.5 model. It explores three key dimensions of scaling: token efficiency, context length, and agent swarms. Innovations like the "M" optimizer and QK clip address training stability and efficiency. The "Kimi linear" architecture enhances long-context understanding, and the agent swarms paradigm enables parallel task execution. Finally, the talk introduces "Attention Residue" as a next-generation architecture, highlighting the importance of open-source collaboration in advancing AI.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Open models offer greater accessibility and control compared to proprietary black-box models.
- Scaling is a primary driver of AI progress, but it needs to be approached across multiple dimensions.
- Three key scaling dimensions are token efficiency (achieving lower loss with fewer tokens), context length (increasing the model's understanding of longer sequences), and the number of agents (using swarms of agents for parallel processing).
- Improving token efficiency is crucial because high-quality data is a limited resource, and better efficiency unlocks higher intelligence from existing data.
- The "M" optimizer is a second-order optimizer that achieves up to 2x token efficiency improvement by transforming gradient updates.
- Key techniques for scaling the "M" optimizer include decay for larger models and adjustable coefficients for consistent RMS updates.
- A distributed implementation of the "M" optimizer was developed for memory efficiency across GPU clusters.
- Scaling to larger models (e.g., 1 trillion parameters) revealed training instability issues, such as exploding max logits and divergence.
- The QK clip technique was introduced to stabilize training by clipping the maximum values of query and key projections, preventing explosions without negatively impacting convergence.
- Transformers inherently possess a better capability for capturing longer contexts than older architectures like LSTMs, as evidenced by decreasing loss with increasing token index.
- Longer context is essential for complex tasks such as understanding entire codebases or managing long agent trajectories.
- The "Kimi linear" architecture is designed to efficiently scale to longer context lengths.
- It utilizes a novel "Kimi delta attention" variant with a fine-grained, channel-wise decay factor (a diagonal matrix) instead of a global scalar, allowing selective retention and forgetting of information.
- A chunkwise formulation and mathematical reformulation enable efficient, exact implementation on modern GPUs without sacrificing performance.
- The agent swarms paradigm orchestrates multiple sub-agents, managed by a main agent or orchestrator, to accomplish complex tasks in parallel.
- This approach is analogous to human organizations, where different roles collaborate towards a common goal.
- Agent swarms significantly reduce execution time for complex tasks compared to single agents.
- New objective functions are introduced: instantiation reward (to encourage parallel execution), finish reward (to ensure sub-tasks are completed meaningfully), and outcome reward (for overall task completion).
- The infrastructure supports parallel execution and multiple reward functions to optimize the agent swarm system.
- Kimi K2.5 is an open model featuring native joint vision-text capabilities, achieved through "early fusion" from day one of training.
- This contrasts with "late fusion" where vision capabilities are added onto a pre-trained text model.
- Early fusion enables emergent capabilities like vision-to-code generation, requiring modalities to be merged into a single representation.
- Vision and text modalities can mutually enhance each other; vision tasks can improve text reasoning, and a strong text base allows for state-of-the-art vision performance with zero vision-specific fine-tuning data.
- The training process for K2.5 was exceptionally stable, even across 30 trillion tokens, attributed to innovations like the "M" optimizer, resulting in a robust base model.
- The "Attention Residue" architecture draws inspiration from temporal dimension techniques (like ResNet's residual connections) and applies them to the depth dimension.
- It generalizes residual connections by using attention over all previous hidden states to compute the current layer's output, rather than just the immediately preceding state.
- A "block attention residue" variant reduces computational overhead by applying attention residue only between blocks of layers, while using standard residuals within blocks.
- This new architecture demonstrates significant improvements in token efficiency (up to 24%) and validation loss.
- It achieves state-of-the-art performance on coding, math, and reasoning tasks.
Key takeaways
- Open models are crucial for democratizing AI, offering transparency and flexibility.
- Scaling AI models requires innovation across multiple dimensions: data efficiency, context handling, and parallel processing.
- Advanced optimizers like the "M" optimizer and techniques like QK clip are essential for stable and efficient training of trillion-parameter models.
- Long-context understanding is a critical capability for AI, enabled by architectures like Kimi linear that selectively manage information over extended sequences.
- Agent swarms represent a paradigm shift towards parallel AI problem-solving, significantly increasing task capacity and efficiency.
- Native integration of modalities (like vision and text) from the start of training leads to emergent capabilities and synergistic performance improvements.
- New architectures like Attention Residue build upon established principles (residual connections, attention) to unlock further gains in model depth, efficiency, and performance.
- The open-source community plays a vital role in rapidly advancing AI by iterating on and improving foundational techniques.
Key terms
Test your understanding
- How does improving token efficiency contribute to advancing AI intelligence, especially given limited high-quality data?
- What technical challenges arise when scaling models to trillions of parameters, and how do techniques like QK clip address them?
- Why is increasing the context length of language models important for tackling complex tasks, and how does Kimi linear achieve this?
- In what ways does the agent swarms paradigm differ from single-agent approaches, and what are its key benefits?
- What is the significance of "early fusion" in Kimi K2.5's vision-text capabilities, and how does it lead to emergent properties?
- How does the Attention Residue architecture generalize or improve upon standard residual connections in deep learning?