Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Experiments

Notebook: failed work, reproductions, logs, and open questions

This section is closer to an internal research notebook than a polished showcase. Negative results and reproducibility traces are preserved because they reveal engineering judgment.

Category

Failed Experiments

1 entries
failed2026-03

Non-Abelian Grokking Capacity Ceiling

Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?

Setup

  • Used the grokking benchmark architecture at d_model=64.
  • Extended non-abelian training horizons beyond the successful abelian window.
  • Tracked train accuracy, test accuracy, CKA, and Peter-Weyl signatures.

Observations

  • Training accuracy saturated at 100%.
  • Test generalization remained stalled.
  • Representation traces showed partial irreducible representation formation without full algorithmic generalization.

Conclusion

Longer training alone did not cross the boundary; the failure is likely capacity or representation-geometry constrained rather than an optimizer patience issue.

Next step

Scale width and depth independently while keeping the group family fixed.

GrokkingInterpretabilityCapacity

Category

Reproduction Studies

1 entries
reproduced2025-08

KV Cache ID Normalization Tests

Can a unit-level reproduction catch prefill/decode request-ID mismatches before deployment hangs?

Setup

  • Created matched and mismatched request-ID fixtures.
  • Exercised prefill-decode KV cache transfer boundary.
  • Asserted decode-side lookup correctness under normalization.

Observations

  • The mismatch reproduced the observed hang path.
  • Normalization made transfer semantics explicit.
  • Tests protected the boundary where the bug entered.

Conclusion

Distributed inference bugs need contract tests around identifiers and transfer semantics.

Next step

Extend tests to multi-node stress fixtures.

vLLMKV CacheDistributed Serving

Category

Benchmark Logs

1 entries
archived2025-11

Fixed-Gamma Speculative Decoding

Is a fixed draft lookahead window enough for CPU speculative decoding?

Setup

  • Paired Qwen2.5-0.5B draft with Qwen2.5-3B verifier.
  • Benchmarked gamma values across code, JSON, and free-form text.
  • Recorded acceptance rate and verifier rollback frequency.

Observations

  • Long gamma worked on low-entropy outputs.
  • High-entropy prompts caused rollback spikes.
  • Mean speedups were less stable than median speedups.

Conclusion

Fixed gamma creates brittle prompt-family dependence.

Next step

Use adaptive gamma based on recent acceptance rate.

Speculative DecodingInferenceBenchmarking

Category

Scaling Studies

1 entries
reproduced2025-09

Sparse MoE Router Entropy

Can routing entropy predict expert collapse before validation loss exposes it?

Setup

  • Trained top-2 MoE with eight experts.
  • Swept z-loss coefficients.
  • Logged per-expert token counts and entropy curves.

Observations

  • Entropy collapse preceded throughput and loss degradation.
  • Moderate z-loss prevented early collapse.
  • Over-regularized routing reduced specialization.

Conclusion

Routing entropy is a leading diagnostic for MoE health.

Next step

Add entropy-triggered z-loss scheduling.

MoERoutingz-loss

Category

Compression Studies

1 entries
reproduced2026-02

Uniform MPO Compression Collapse

Does a single global bond dimension preserve quality across all transformer layers?

Setup

  • Applied uniform chi schedules across all attention blocks.
  • Swept chi from 10 to 90.
  • Evaluated MMLU deltas before and after one epoch of healing.

Observations

  • Early blocks collapsed below chi=50.
  • Terminal blocks tolerated chi=10 with small accuracy movement.
  • Uniform schedules wasted capacity on tolerant layers while damaging fragile layers.

Conclusion

Layer sensitivity is too uneven for uniform MPO schedules to be the final compression policy.

Next step

Use per-block policy search seeded from sensitivity profiles.

Tensor NetworksCompressionMMLU

Category

Interpretability Notes

0 entries

No public notes yet.

Category

Open Questions

0 entries

No public notes yet.

Category

Research Ideas

1 entries
draft2026-04

Activation Patching for Algebraic Circuits

Can causal patching separate memorized lookup behavior from algorithmic circuit behavior?

Setup

  • Patch candidate channels between grokked abelian runs and memorized non-abelian runs.
  • Intervene on embedding, attention output, and MLP residual stream locations.
  • Measure recovery of test generalization behavior under patched activations.

Observations

  • Not yet executed.
  • The current spectral signatures suggest likely intervention points.
  • Needs careful pair construction to avoid operation mismatch artifacts.

Conclusion

This is the next causal validation layer for the grokking study.

Next step

Build a patching harness over the existing algebraic benchmark.

Activation PatchingGrokkingMechanistic Interpretability