Mixture of Experts in 2026: How Modern LLMs Route Tokens Efficiently

Mixture of Experts in 2026: How Modern LLMs Route Tokens Efficiently

Mixture of Experts in 2026: How Modern LLMs Route Tokens Efficiently

The largest language models in 2026 share a common architectural pattern: mixture of experts (MoE). GPT-5.4, Qwen 3.5, DeepSeek-V4, and Mixtral all use MoE to achieve the performance of massive models at the inference cost of much smaller ones. Understanding how MoE routing works is essential for anyone building with, deploying, or evaluating these models.

The core idea behind mixture of experts architecture is simple: instead of activating every parameter for every token, the model selects a small subset of specialized “expert” sub-networks for each input. A 72 billion parameter model might only use 14 billion parameters per token, cutting compute costs by 80% while maintaining access to the full knowledge stored across all parameters.

How MoE Routing Works

  • Expert layers replace standard feed-forward layers. In a typical transformer, each layer has one feed-forward network (FFN). In an MoE model, each layer has multiple FFNs (experts), and a routing function decides which ones to activate.
  • The router is a learned function. It takes the hidden state of each token as input and outputs a probability distribution over the available experts. The top-k experts (usually 2) are selected for each token.
  • Expert outputs are combined. The selected experts process the token independently, and their outputs are weighted by the router’s probabilities and summed.
  • Different tokens use different experts. A token about code might route to experts specialized in programming patterns, while a token about medical terminology routes to different experts. This specialization emerges during training.

Why MoE Makes Large Models Practical

The computational cost of running a transformer scales linearly with the number of active parameters per token. A dense 72B model requires 72B parameter computations per token. An MoE model with 72B total parameters but only 14B active per token requires the same compute as a 14B dense model.

This has three practical benefits:

Faster inference. Because fewer parameters activate per token, MoE models generate tokens faster than equivalently sized dense models. Qwen 3.5 (72B MoE, 14B active) generates tokens at roughly the same speed as a dense 14B model on the same hardware.

Lower memory bandwidth requirements. LLM inference is bottlenecked by memory bandwidth, not compute. MoE models read fewer parameters from memory per token, which directly translates to higher throughput on bandwidth-limited GPUs.

Better performance per compute unit. A 72B MoE model trained on the same data as a 14B dense model typically outperforms the dense model by 5-10 points on standard benchmarks, because the MoE model can store and retrieve more knowledge from its full 72B parameter set.

“MoE is not a free lunch. You still need the memory to store all 72B parameters. But the compute per token drops dramatically, which is what matters for inference cost and latency.” — Research scientist at a model development lab.

The Load Balancing Problem

The biggest challenge in MoE training is load balancing. If the router sends most tokens to a few “popular” experts and ignores the rest, those unused experts waste memory without contributing to model quality. Early MoE models suffered from expert collapse, where a handful of experts handled nearly all traffic.

Modern MoE models solve this with auxiliary loss functions that penalize uneven expert usage during training. The loss encourages the router to distribute tokens roughly equally across all experts. Qwen 3.5 uses a combination of load balancing loss and expert capacity limits that caps the maximum number of tokens any single expert can process per batch.

The trade-off is that forced load balancing can reduce routing quality. If the best expert for a given token is already at capacity, the router sends it to a less specialized expert. Current models handle this well enough that the quality impact is minimal, but it remains an active area of research.

MoE Deployment Challenges

Running MoE models in production introduces challenges that dense models do not have.

Memory Requirements

All expert parameters must be in GPU memory, even though only a fraction are used per token. A 72B MoE model needs the same memory as a 72B dense model (about 144GB in BF16) despite using only 14B parameters per forward pass. Quantization helps: the same model in 4-bit GPTQ fits in about 40GB.

Throughput Variability

MoE routing is inherently less predictable than dense model computation. Different batches of tokens may route to different experts, creating uneven GPU utilization. Inference engines like vLLM and TensorRT-LLM have added MoE-specific optimizations to reduce this variability, but throughput is still 10-15% less consistent than dense models.

Expert Parallelism

When running MoE on multiple GPUs, experts can be distributed across devices. This adds communication overhead because tokens must be routed to the GPU hosting the selected expert. NVLink and high-speed interconnects reduce this overhead, but it is still a factor in multi-GPU deployments.

The 2026 MoE Landscape

The trend is clear: MoE is becoming the default architecture for frontier models. Here is how the current leaders compare:

  1. GPT-5.4: Rumored to use MoE with undisclosed expert count. Performance suggests a very large total parameter count with efficient routing.
  2. Qwen 3.5: 72B total, 14B active, 64 experts with top-8 routing. Open weights, Apache 2.0 license.
  3. DeepSeek-V4: 236B total, 21B active. One of the largest open MoE models with strong multilingual performance.
  4. Mixtral 8x22B: 176B total, 44B active. Mistral’s open MoE offering, strong for European languages.

Should You Choose MoE or Dense Models?

For inference-heavy applications where you need the best quality per compute dollar, MoE models are the clear choice. The performance-to-cost ratio is significantly better than dense models at the same scale.

For fine-tuning, dense models are simpler to work with. MoE fine-tuning requires careful handling of expert routing to avoid destabilizing the load balance. LoRA fine-tuning on MoE models is well-supported by current tooling, but full fine-tuning of MoE models remains complex.

For most production applications in 2026, MoE is the winning architecture. The inference cost advantage is too large to ignore.