Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Available for contract work palmani2410@gmail.comEmailDelhi, IndiaGitHub LinkedIn

System case study / 2025

Sparse MoE Scaling

Top-k routing, z-loss load balancing, and expert collapse analysis

reproducedPyTorch / FSDP / Triton / Weights & Biases

MoEDistributed TrainingRoutingScaling Studies

Throughput

2.3x

Sparse versus dense matched-FLOP baseline.

Experts

8

Top-2 sparse routing.

Active params

125M

Active parameter budget per token.

Motivation

Reproduce the practical failure modes of sparse MoE training, especially expert-utilization collapse, under a controlled matched-FLOP benchmark.

Design Constraints

Use sparse top-k routing with k=2 and 8 experts.
Match active compute against dense baselines.
Log routing entropy and expert load over time.
Keep runs reproducible through W&B artifacts.

System Architecture

Sparse MoE layer with Switch Transformer-style z-loss.
125M active parameters in a 1B-compute-equivalent model.
FSDP training with expert routing traces.
Dense matched-FLOP baseline for throughput comparison.

Performance Bottlenecks

Expert imbalance under insufficient z-loss.
All-to-all communication sensitivity in distributed settings.
Router instability early in training.
Underutilization when top-k probabilities collapse.

Optimization Decisions

Sweep z-loss coefficient thresholds.
Plot routing entropy alongside throughput and validation loss.
Separate active parameter count from total parameter count in all reporting.
Open-source run configs and logs.

Benchmark Methodology

Compared sparse model throughput against dense matched-FLOP baseline.
Logged expert load histograms across training.
Measured routing entropy decay and collapse thresholds.
Repeated runs under multiple z-loss settings.

Results

Reached 2.3x throughput versus dense baseline at matched FLOP budget.
Identified z-loss thresholds that prevent expert utilization collapse.
Produced reusable routing entropy curves for diagnostics.

Lessons Learned

MoE throughput wins are inseparable from router health.
Routing entropy should be a first-class training metric.
Load-balancing losses can stabilize experts while still harming specialization if over-applied.