Running a 35B AI Model on 6GB VRAM, FAST (llama.cpp Guide)

Codacus

7 chapters7 takeaways14 key terms5 questions

Overview

This video demonstrates how to run a large 35-billion-parameter AI model, Qwen 3.6 35B A3B, on a low-end, 8-year-old GPU with only 6GB of VRAM at a usable speed. The presenter details five specific optimization flags for the llama.cpp software that drastically improve performance and context window size without requiring hardware upgrades or model quantization. The video contrasts a naive approach with a series of advanced techniques, highlighting the importance of understanding model architecture (Mixture of Experts) and memory management for efficient inference on constrained hardware. It also explores a failed attempt with speculative decoding, explaining why it's not suitable for this model type.

How was this?

Save this permanently with flashcards, quizzes, and AI chat

Chapters

It's possible to run a 35B parameter AI model on a 6GB VRAM GPU at a practical speed.
The video focuses on optimizing performance using llama.cpp, a flexible inference engine.
The specific model is Qwen 3.6 35B A3B, a Mixture of Experts (MoE) model.
The test hardware is an old GTX 1060 (6GB VRAM, PCIe Gen 3) with a basic CPU and 24GB RAM, representing a minimum viable setup.

This chapter sets the stage by defining the ambitious goal and introducing the key software and hardware components, establishing that the demonstration aims to push the limits of older, accessible hardware.

A GTX 1060 GPU with 6GB of VRAM, an i3 8100 CPU, and 24GB of DDR4 RAM.

The basic approach is to split the model layers between GPU and CPU using the `-ngl` flag.
Placing only 20 layers on the GPU (`-ngl 20`) results in very slow performance (around 3 tokens/sec).
This slowness is due to constant data transfer across the PCIe bus for layers and their experts residing on the CPU.
Mixture of Experts (MoE) models have many inactive 'expert' weights that are only loaded when needed.

Understanding this baseline highlights the inefficiency of simply offloading layers without considering the model's architecture, demonstrating why a more sophisticated approach is necessary.

Using the `-ngl 20` flag to put the first 20 layers on the GPU, resulting in 3 tokens/sec.

MoE models are efficient because only a subset of 'experts' are active per token.
The optimal strategy is to keep the small, fast-executing parts on the GPU and offload the large, mostly inactive expert blocks to the CPU/RAM.
The `--n-cpu-moe` flag allows pinning expert layers to the CPU.
Setting `--n-cpu-moe 41` (pinning all experts to CPU) significantly boosts speed from 3 to 10 tokens/sec.

This chapter introduces a core optimization technique that exploits the specific architecture of the MoE model, showing how intelligent placement of model components dramatically improves inference speed.

Using the `--n-cpu-moe 41` flag to move all model experts to the CPU, increasing speed to 10 tokens/sec.

By default, llama.cpp uses memory mapping (`mmap`), which can cause delays due to disk reads for needed data.
Disabling `mmap` (`--no-mmap`) forces the entire model into RAM upfront, eliminating disk I/O during inference.
This change increases speed from 10 to 13.5 tokens/sec.
Increasing the GPU layer count (`-ngl 35` instead of 41) moves more experts to VRAM, further boosting speed to 17 tokens/sec, but reduces the context window size.

These optimizations focus on reducing latency by improving how the model accesses its data, demonstrating that even seemingly small changes in memory management can yield substantial performance gains.

Disabling mmap (`--no-mmap`) and adjusting GPU layers (`-ngl 35`) to achieve 17 tokens/sec.

The KV cache stores past token information and grows linearly with context length, consuming significant VRAM.
Aggressive quantization techniques like Turbo Quant (4-bit keys, 3-bit values) drastically reduce KV cache size with minimal quality loss.
Using Turbo Quant flags (`--kv-offload-type q4_0`, `--kv-offload-type q3_0` or similar flags implied by context) allows for much larger context windows.
By combining Turbo Quant with adjustments to `--n-cpu-moe` (e.g., 36), a 256,000 token context window can be achieved on 6GB VRAM at the same 17 tokens/sec speed.

This section addresses a critical limitation: context window size. It shows how advanced compression techniques can enable processing of vastly larger amounts of text without sacrificing speed or quality.

Achieving a 256,000 token context window using Turbo Quantization and adjusting `--n-cpu-moe` to 36, while maintaining 17 tokens/sec.

Without explicit locking, the OS can page out model components from RAM to disk when memory pressure increases, causing slowdowns.
Memory locking (`mlock`) prevents the OS from moving critical model data out of RAM.
This requires configuration in Docker, the container runtime, and llama.cpp itself.
Enabling memory locking ensures consistent performance over long periods, preventing the 'day three slowdown'.

This final optimization focuses on production readiness, ensuring that the system remains stable and performant over extended use, not just in short bursts.

Configuring Docker, LXC, and llama.cpp with `mlock` to prevent the OS from paging out model components from RAM.

Speculative decoding uses a smaller model to predict tokens, which are then verified in batches by the larger model.
This technique failed to improve performance on the Qwen MoE model, resulting in slower speeds (17 to 11 tokens/sec).
The failure is attributed to the MoE architecture (high expert diversity per batch) and State Space Layers (SSM) which limit parallelization.
Speculative decoding is effective for dense transformer models but not for MoE models with SSM layers.

Discussing failed attempts provides valuable negative examples, reinforcing the understanding of *why* certain optimizations work by explaining why others don't, based on architectural constraints.

Attempting speculative decoding with a smaller Qwen model as a drafter, which resulted in a speed decrease due to MoE and SSM layer complexities.

Key takeaways

1Large AI models can be run on modest hardware by optimizing software configurations, not just by upgrading hardware.
2Understanding the architecture of AI models, like Mixture of Experts (MoE), is crucial for effective performance tuning.
3Intelligent placement of model components (GPU vs. CPU/RAM) based on their computational cost and activation patterns is key.
4Disabling OS optimizations like memory mapping (mmap) can improve inference speed by ensuring all data is readily available in RAM.
5Advanced quantization techniques like Turbo Quant allow for massive context windows without significant quality degradation or performance loss.
6Memory locking (`mlock`) is essential for maintaining stable performance over long inference sessions by preventing the OS from swapping model data.
7Optimization strategies that work for standard transformer models may not apply to newer architectures like MoE with SSM layers.

Key terms

llama.cppMixture of Experts (MoE)VRAMParameterInferenceLayer OffloadingNGL (Number of GPU Layers)N-CPU-MOEMMAP (Memory Mapping)KV CacheQuantizationTurbo QuantContext WindowMLOCK (Memory Lock)

Test your understanding

1How does the Mixture of Experts (MoE) architecture differ from dense models, and why is this difference important for optimizing performance on limited VRAM?
2Explain the trade-offs involved when adjusting the number of GPU layers (`-ngl`) and how it impacts both speed and context window size.
3What is the purpose of memory locking (`mlock`), and why is it necessary for stable, long-term operation of large models on consumer hardware?
4Why did speculative decoding fail to improve performance for the Qwen MoE model, and what architectural features contributed to this failure?
5How can techniques like disabling `mmap` and using Turbo Quantization help overcome the limitations of low VRAM and small context windows?