Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

System case study / 2025

Sparse MoE Scaling

Top-k routing, z-loss load balancing, and expert collapse analysis

reproducedPyTorch / FSDP / Triton / Weights & Biases
MoEDistributed TrainingRoutingScaling Studies

Throughput

2.3x

Sparse versus dense matched-FLOP baseline.

Experts

8

Top-2 sparse routing.

Active params

125M

Active parameter budget per token.

Motivation

Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.

Design Constraints

  • Use sparse top-k routing with k=2 and 8 experts.
  • Match active compute against dense baselines.
  • Log routing entropy and expert load over time.
  • Keep runs reproducible through W&B artifacts.

System Architecture

  • Sparse MoE layer with Switch Transformer-style z-loss.
  • 125M active parameters in a 1B-compute-equivalent model.
  • FSDP training with expert routing traces.
  • Dense matched-FLOP baseline for throughput comparison.

Performance Bottlenecks

  • Expert imbalance under insufficient z-loss.
  • All-to-all communication sensitivity in distributed settings.
  • Router instability early in training.
  • Underutilization when top-k probabilities collapse.

Optimization Decisions

  • Sweep z-loss coefficient thresholds.
  • Plot routing entropy alongside throughput and validation loss.
  • Separate active parameter count from total parameter count in all reporting.
  • Open-source run configs and logs.

Benchmark Methodology

  • Compared sparse model throughput against dense matched-FLOP baseline.
  • Logged expert load histograms across training.
  • Measured routing entropy decay and collapse thresholds.
  • Repeated runs under multiple z-loss settings.

Results

  • Reached 2.3x throughput versus dense baseline at matched FLOP budget.
  • Identified z-loss thresholds that prevent expert utilization collapse.
  • Produced reusable routing entropy curves for diagnostics.

Lessons Learned

  • MoE throughput wins are inseparable from router health.
  • Routing entropy should be a first-class training metric.
  • Load-balancing losses can stabilize experts while still harming specialization if over-applied.