
CSE/DSC 234 Spring 2026 Guest Lecture: Lakshya Agrawal (UC Berkeley)
Arun Kumar
Overview
This lecture introduces Jeppa, a novel framework for "reflective optimization" that significantly enhances AI capabilities by enabling models to learn from their own experiences and textual feedback. Unlike traditional methods that rely heavily on massive datasets and gradient descent, Jeppa optimizes AI systems by refining their prompts and system specifications. This approach is demonstrated to be highly sample-efficient, capable of improving performance on complex tasks with minimal data, and applicable to various AI systems, including code generation, agent design, and even training model weights. The core idea is to leverage the rich information within text-based feedback to guide AI towards better performance, automating processes that previously required extensive human engineering.
Save this permanently with flashcards, quizzes, and AI chat
Chapters
- Traditional AI training methods (pre-training, fine-tuning, RL) require vast amounts of data (trillions of tokens, thousands of examples).
- As AI tackles more complex problems, sample efficiency (learning from fewer examples) becomes a critical bottleneck, especially in domains with limited data.
- Real-world applications and tool integrations can be slow or expensive, further exacerbating the sample inefficiency problem.
- Current reinforcement learning methods lose valuable information by only using binary reward signals, ignoring detailed traces of thought, tool calls, and error messages.
- Jeppa proposes 'reflective optimization' where AI reflects on its own past actions and feedback, not just numerical rewards.
- The AI analyzes detailed traces of its rollouts (thoughts, tool calls, errors) to diagnose failures and learn.
- Instead of solely updating model weights, Jeppa can update the AI's system prompt, which can induce significant behavioral changes with natural language instructions.
- This allows learning from as few as one rollout by correcting mistakes and refining the prompt.
- Jeppa uses a genetic algorithm where prompts are treated as 'genes' that are mutated and selected.
- It employs a multi-objective selection strategy using a Pareto frontier to balance exploration and exploitation.
- A scoring matrix tracks prompt performance across validation items, identifying the best prompts for each task.
- The system iteratively selects prompts from the Pareto frontier, runs them on dev examples, reflects on feedback, and updates the prompt pool.
- Jeppa achieves significant performance improvements with far fewer rollouts compared to state-of-the-art methods like GRPO.
- It automates prompt engineering, a process that can take weeks for human teams, by discovering latent task specifications and edge cases.
- Jeppa can optimize proprietary, black-box models, improving their performance even beyond their original capabilities.
- It demonstrates remarkable sample efficiency, optimizing LLMs for novel hardware accelerators with minimal initial training data.
- Jeppa's 'Optimize Anything' API extends reflective optimization to any text artifact, not just prompts.
- This includes optimizing code, agent architectures, numerical parameters, and even policy optimization for data centers.
- The core idea is to use actionable side information (like compiler traces, gradients, SLA violations) as textual feedback to guide optimization.
- It offers modes for generalization, single-task optimization, and multi-task optimization, adapting to different goals.
- Jeppa can automatically design and optimize agent architectures, including control flow, prompts, and multi-agent interactions.
- It automates the discovery of complex agent pipelines that significantly outperform simpler designs.
- The 'fast slow training' paradigm combines Jeppa's prompt/context optimization (fast loop) with traditional RL weight updates (slow loop) for more robust learning.
- This hybrid approach mitigates issues like catastrophic forgetting in weight updates and performance plateaus in prompt optimization alone.
- Jeppa's principles apply beyond text models to multimodal and VLM models, improving tasks like OCR and medical diagnosis.
- It works across a wide range of model scales, from small 1B parameter models to large frontier models, often achieving significant cost reductions.
- Jeppa can optimize subjective tasks by using LLMs as judges trained on human annotations, creating a data flywheel for continuous improvement.
- The core insight is that as models improve their instruction-following capabilities, precise textual specifications become increasingly critical for unlocking their full potential.
Key takeaways
- AI training is increasingly bottlenecked by sample efficiency, necessitating methods that learn effectively from limited data.
- Reflective optimization, as implemented by Jeppa, leverages rich textual feedback (errors, traces) to enable AI to learn and improve autonomously.
- Updating system prompts can be a highly effective way to induce significant behavioral changes in LLMs, often more efficiently than weight updates.
- Jeppa's Pareto frontier approach ensures diverse exploration of optimization strategies, preventing local optima and leading to more robust solutions.
- The 'Optimize Anything' framework extends reflective optimization to various text-based artifacts, enabling AI to tackle complex, non-differentiable problems.
- Combining prompt/context optimization with weight updates (fast slow training) offers a powerful paradigm for overcoming the limitations of each individual method.
- As AI models improve instruction following, precise textual specifications and prompt optimization become even more critical for maximizing performance.
Key terms
Test your understanding
- How does Jeppa's approach to learning from AI rollouts differ from traditional reinforcement learning?
- Explain the role of the Pareto frontier in Jeppa's optimization process and why it is important for exploration.
- What does the 'Optimize Anything' framework allow AI to optimize beyond just system prompts?
- How does the 'fast slow training' paradigm combine different learning mechanisms to improve AI training?
- Why is prompt optimization expected to remain crucial even as AI models become more capable?