Experiments

Notebook: failed work, reproductions, logs, and open questions

This section is closer to an internal research notebook than a polished showcase. Negative results and reproducibility traces are preserved because they reveal engineering judgment.

Category

Failed Experiments

1 entries

failed2026-03

Non-Abelian Grokking Capacity Ceiling

Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?

Setup

Used the grokking benchmark architecture at d_model=64.
Extended non-abelian training horizons beyond the successful abelian window.
Tracked train accuracy, test accuracy, CKA, and Peter-Weyl signatures.

Observations

Training accuracy saturated at 100%.
Test generalization remained stalled.
Representation traces showed partial irreducible representation formation without full algorithmic generalization.

Conclusion

Longer training alone did not cross the boundary; the failure is likely capacity or representation-geometry constrained rather than an optimizer patience issue.

Next step

Scale width and depth independently while keeping the group family fixed.

GrokkingInterpretabilityCapacity

Category

Reproduction Studies

1 entries

reproduced2025-08

KV Cache ID Normalization Tests

Can a unit-level reproduction catch prefill/decode request-ID mismatches before deployment hangs?

Setup

Created matched and mismatched request-ID fixtures.
Exercised prefill-decode KV cache transfer boundary.
Asserted decode-side lookup correctness under normalization.

Observations

The mismatch reproduced the observed hang path.
Normalization made transfer semantics explicit.
Tests protected the boundary where the bug entered.

Conclusion

Distributed inference bugs need contract tests around identifiers and transfer semantics.

Next step

Extend tests to multi-node stress fixtures.

vLLMKV CacheDistributed Serving

Category

Benchmark Logs

1 entries

archived2025-11

Fixed-Gamma Speculative Decoding

Is a fixed draft lookahead window enough for CPU speculative decoding?

Setup

Paired Qwen2.5-0.5B draft with Qwen2.5-3B verifier.
Benchmarked gamma values across code, JSON, and free-form text.
Recorded acceptance rate and verifier rollback frequency.

Observations

Long gamma worked on low-entropy outputs.
High-entropy prompts caused rollback spikes.
Mean speedups were less stable than median speedups.

Conclusion

Fixed gamma creates brittle prompt-family dependence.

Next step

Use adaptive gamma based on recent acceptance rate.

Speculative DecodingInferenceBenchmarking

Category

Scaling Studies

1 entries

reproduced2025-09

Sparse MoE Router Entropy

Can routing entropy predict expert collapse before validation loss exposes it?

Setup

Trained top-2 MoE with eight experts.
Swept z-loss coefficients.
Logged per-expert token counts and entropy curves.

Observations

Entropy collapse preceded throughput and loss degradation.
Moderate z-loss prevented early collapse.
Over-regularized routing reduced specialization.

Conclusion

Routing entropy is a leading diagnostic for MoE health.

Next step

Add entropy-triggered z-loss scheduling.

MoERoutingz-loss

Category

Compression Studies

1 entries

reproduced2026-02

Uniform MPO Compression Collapse

Does a single global bond dimension preserve quality across all transformer layers?

Setup

Applied uniform chi schedules across all attention blocks.
Swept chi from 10 to 90.
Evaluated MMLU deltas before and after one epoch of healing.

Observations

Early blocks collapsed below chi=50.
Terminal blocks tolerated chi=10 with small accuracy movement.
Uniform schedules wasted capacity on tolerant layers while damaging fragile layers.

Conclusion

Layer sensitivity is too uneven for uniform MPO schedules to be the final compression policy.

Next step

Use per-block policy search seeded from sensitivity profiles.

Tensor NetworksCompressionMMLU

Category

Interpretability Notes

0 entries

No public notes yet.

Category

Open Questions

0 entries

No public notes yet.

Category

Research Ideas

1 entries

draft2026-04

Activation Patching for Algebraic Circuits

Can causal patching separate memorized lookup behavior from algorithmic circuit behavior?

Setup

Patch candidate channels between grokked abelian runs and memorized non-abelian runs.
Intervene on embedding, attention output, and MLP residual stream locations.
Measure recovery of test generalization behavior under patched activations.

Observations

Not yet executed.
The current spectral signatures suggest likely intervention points.
Needs careful pair construction to avoid operation mismatch artifacts.

Conclusion

This is the next causal validation layer for the grokking study.

Next step

Build a patching harness over the existing algebraic benchmark.

Activation PatchingGrokkingMechanistic Interpretability

Mani Pal

Notebook: failed work, reproductions, logs, and open questions

Failed Experiments

Non-Abelian Grokking Capacity Ceiling

Reproduction Studies

KV Cache ID Normalization Tests

Benchmark Logs

Fixed-Gamma Speculative Decoding

Scaling Studies

Sparse MoE Router Entropy

Compression Studies

Uniform MPO Compression Collapse

Interpretability Notes

Open Questions

Research Ideas

Activation Patching for Algebraic Circuits