System case study / 2025
Sparse MoE Scaling
Top-k routing, z-loss load balancing, and expert collapse analysis
reproducedPyTorch / FSDP / Triton / Weights & Biases
MoEDistributed TrainingRoutingScaling Studies
Throughput
2.3x
Sparse versus dense matched-FLOP baseline.
Experts
8
Top-2 sparse routing.
Active params
125M
Active parameter budget per token.
Motivation
Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.
Design Constraints
- Use sparse top-k routing with k=2 and 8 experts.
- Match active compute against dense baselines.
- Log routing entropy and expert load over time.
- Keep runs reproducible through W&B artifacts.
System Architecture
- Sparse MoE layer with Switch Transformer-style z-loss.
- 125M active parameters in a 1B-compute-equivalent model.
- FSDP training with expert routing traces.
- Dense matched-FLOP baseline for throughput comparison.
Performance Bottlenecks
- Expert imbalance under insufficient z-loss.
- All-to-all communication sensitivity in distributed settings.
- Router instability early in training.
- Underutilization when top-k probabilities collapse.
Optimization Decisions
- Sweep z-loss coefficient thresholds.
- Plot routing entropy alongside throughput and validation loss.
- Separate active parameter count from total parameter count in all reporting.
- Open-source run configs and logs.
Benchmark Methodology
- Compared sparse model throughput against dense matched-FLOP baseline.
- Logged expert load histograms across training.
- Measured routing entropy decay and collapse thresholds.
- Repeated runs under multiple z-loss settings.
Results
- Reached 2.3x throughput versus dense baseline at matched FLOP budget.
- Identified z-loss thresholds that prevent expert utilization collapse.
- Produced reusable routing entropy curves for diagnostics.
Lessons Learned
- MoE throughput wins are inseparable from router health.
- Routing entropy should be a first-class training metric.
- Load-balancing losses can stabilize experts while still harming specialization if over-applied.