How We Scaled Kimi K2.5 | Zhilin Yang's full GTC 2026 Keynote

Kimi AI

6 chapters8 takeaways13 key terms6 questions

Overview

This presentation details the strategies and innovations behind scaling large language models, specifically focusing on the Kimi K2.5 model. It explores three key dimensions of scaling: token efficiency, context length, and agent swarms. Innovations like the "M" optimizer and QK clip address training stability and efficiency. The "Kimi linear" architecture enhances long-context understanding, and the agent swarms paradigm enables parallel task execution. Finally, the talk introduces "Attention Residue" as a next-generation architecture, highlighting the importance of open-source collaboration in advancing AI.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

Open models offer greater accessibility and control compared to proprietary black-box models.
Scaling is a primary driver of AI progress, but it needs to be approached across multiple dimensions.
Three key scaling dimensions are token efficiency (achieving lower loss with fewer tokens), context length (increasing the model's understanding of longer sequences), and the number of agents (using swarms of agents for parallel processing).
Improving token efficiency is crucial because high-quality data is a limited resource, and better efficiency unlocks higher intelligence from existing data.

Understanding these scaling dimensions is fundamental to comprehending how AI models are becoming more capable and accessible, driving innovation in the field.

The presentation contrasts scaling only the number of training tokens (which leads to lower loss) with improving token efficiency, which shifts the loss curve leftward, meaning less data is needed for the same performance.

The "M" optimizer is a second-order optimizer that achieves up to 2x token efficiency improvement by transforming gradient updates.
Key techniques for scaling the "M" optimizer include decay for larger models and adjustable coefficients for consistent RMS updates.
A distributed implementation of the "M" optimizer was developed for memory efficiency across GPU clusters.
Scaling to larger models (e.g., 1 trillion parameters) revealed training instability issues, such as exploding max logits and divergence.
The QK clip technique was introduced to stabilize training by clipping the maximum values of query and key projections, preventing explosions without negatively impacting convergence.

These innovations are critical for efficiently training massive models, pushing the boundaries of AI performance by maximizing the utility of every training token and ensuring stable, scalable training processes.

When training a 1 trillion parameter model, the max logits would quickly exceed 1,000 and the training loss would diverge. Applying QK clip stabilized this by constraining the max logit to a constant value around 100, allowing training to proceed smoothly.

Transformers inherently possess a better capability for capturing longer contexts than older architectures like LSTMs, as evidenced by decreasing loss with increasing token index.
Longer context is essential for complex tasks such as understanding entire codebases or managing long agent trajectories.
The "Kimi linear" architecture is designed to efficiently scale to longer context lengths.
It utilizes a novel "Kimi delta attention" variant with a fine-grained, channel-wise decay factor (a diagonal matrix) instead of a global scalar, allowing selective retention and forgetting of information.
A chunkwise formulation and mathematical reformulation enable efficient, exact implementation on modern GPUs without sacrificing performance.

Enabling models to process and understand much longer sequences of text is vital for tackling more complex, real-world problems that require comprehensive context.

Kimi linear can efficiently handle context lengths of 1 million tokens or more, outperforming full attention baselines and other variants on long-context tasks while remaining efficient.

The agent swarms paradigm orchestrates multiple sub-agents, managed by a main agent or orchestrator, to accomplish complex tasks in parallel.
This approach is analogous to human organizations, where different roles collaborate towards a common goal.
Agent swarms significantly reduce execution time for complex tasks compared to single agents.
New objective functions are introduced: instantiation reward (to encourage parallel execution), finish reward (to ensure sub-tasks are completed meaningfully), and outcome reward (for overall task completion).
The infrastructure supports parallel execution and multiple reward functions to optimize the agent swarm system.

This paradigm shift from single-agent to multi-agent systems allows AI to tackle significantly more complex problems by distributing work and leveraging parallel computation, mirroring sophisticated human problem-solving.

An example scenario involves AI researchers, web developers, and fact-checkers collaborating within an agent swarm to research a topic, assemble findings, and produce a comprehensive report.

Kimi K2.5 is an open model featuring native joint vision-text capabilities, achieved through "early fusion" from day one of training.
This contrasts with "late fusion" where vision capabilities are added onto a pre-trained text model.
Early fusion enables emergent capabilities like vision-to-code generation, requiring modalities to be merged into a single representation.
Vision and text modalities can mutually enhance each other; vision tasks can improve text reasoning, and a strong text base allows for state-of-the-art vision performance with zero vision-specific fine-tuning data.
The training process for K2.5 was exceptionally stable, even across 30 trillion tokens, attributed to innovations like the "M" optimizer, resulting in a robust base model.

Integrating vision and text natively from the start unlocks new emergent abilities and demonstrates a powerful synergy where modalities enhance each other, leading to more versatile and capable AI.

Kimi K2.5 can read a video and generate a website that replicates its style, a capability that emerges from the joint vision-text training, not possible with separate modality training.

The "Attention Residue" architecture draws inspiration from temporal dimension techniques (like ResNet's residual connections) and applies them to the depth dimension.
It generalizes residual connections by using attention over all previous hidden states to compute the current layer's output, rather than just the immediately preceding state.
A "block attention residue" variant reduces computational overhead by applying attention residue only between blocks of layers, while using standard residuals within blocks.
This new architecture demonstrates significant improvements in token efficiency (up to 24%) and validation loss.
It achieves state-of-the-art performance on coding, math, and reasoning tasks.

This architectural innovation represents a potential leap forward in training deeper, more efficient neural networks, building upon the success of residual connections and attention mechanisms.

Attention Residue can improve token efficiency by 24%, meaning that 50 trillion high-quality tokens are effectively equivalent to over 60 trillion tokens, leading to better performance on benchmarks like GPQA, math, and HumanEval.

Key takeaways

1Open models are crucial for democratizing AI, offering transparency and flexibility.
2Scaling AI models requires innovation across multiple dimensions: data efficiency, context handling, and parallel processing.
3Advanced optimizers like the "M" optimizer and techniques like QK clip are essential for stable and efficient training of trillion-parameter models.
4Long-context understanding is a critical capability for AI, enabled by architectures like Kimi linear that selectively manage information over extended sequences.
5Agent swarms represent a paradigm shift towards parallel AI problem-solving, significantly increasing task capacity and efficiency.
6Native integration of modalities (like vision and text) from the start of training leads to emergent capabilities and synergistic performance improvements.
7New architectures like Attention Residue build upon established principles (residual connections, attention) to unlock further gains in model depth, efficiency, and performance.
8The open-source community plays a vital role in rapidly advancing AI by iterating on and improving foundational techniques.

Key terms

Open ModelsScaling LawsToken EfficiencyContext LengthAgent SwarmsM OptimizerQK ClipKimi LinearKimi Delta AttentionAttention ResidueEarly FusionLate FusionResidual Connections

Test your understanding

1How does improving token efficiency contribute to advancing AI intelligence, especially given limited high-quality data?
2What technical challenges arise when scaling models to trillions of parameters, and how do techniques like QK clip address them?
3Why is increasing the context length of language models important for tackling complex tasks, and how does Kimi linear achieve this?
4In what ways does the agent swarms paradigm differ from single-agent approaches, and what are its key benefits?
5What is the significance of "early fusion" in Kimi K2.5's vision-text capabilities, and how does it lead to emergent properties?
6How does the Attention Residue architecture generalize or improve upon standard residual connections in deep learning?