Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Zenodo / 2026

Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers

Capacity-dependent boundary between abelian and non-abelian grokking

publishedMani Pal
Mechanistic InterpretabilityTransformersGrokkingRepresentation Geometry

Operations

8

Abelian fields, composite ring, and non-abelian groups.

Seeds

24

Three seeds for each algebraic operation.

Mean CKA

0.90

High cross-operation embedding similarity.

Abstract

This work extends grokking analysis beyond modular addition to eight algebraic operations across abelian fields, a composite ring, and non-abelian groups. A controlled transformer setup isolates when memorized algorithms become reusable circuits and when representation complexity blocks generalization.

Problem Statement

Prior grokking studies often center on modular addition. The missing question is whether small transformers discover reusable algebraic circuits across richer algebraic structure, and whether non-abelian operations fail because they lack signal or because the model capacity is insufficient for the required representation.

Methodology

  • Trained one-layer transformers with d_model=64 across eight algebraic operations and three seeds per operation.
  • Compared abelian operations, a composite ring, and non-abelian groups with consistent optimizer, dataset, and training-fraction controls.
  • Measured CKA across embedding spaces, Fourier concentration before and after discrete-log re-indexing, and Peter-Weyl signatures in non-abelian models.
  • Derived formal complexity scores from character tables to test whether representation complexity predicts grokking order.

Experimental Design

  • Held architecture depth and width fixed to expose capacity effects.
  • Ran training-fraction, weight-decay, and dataset-size ablations.
  • Separated train memorization from test generalization and analyzed circuit formation after memorization.
  • Logged cross-operation embedding alignment across all 28 operation pairs.

Results

  • All four abelian operations grokked to 100% test accuracy within 2,000 epochs.
  • All four non-abelian groups reached 100% training accuracy but failed to grok under the fixed capacity setting.
  • Discrete-log re-indexing improved multiplication Fourier concentration from 9.4% to 20.0%.
  • Peter-Weyl analysis recovered the dominant irreducible representation in all four non-grokked non-abelian cases.
  • CKA remained high across operation pairs with a mean of 0.90 and an add-S3 pair at 0.97.

Limitations

  • The study uses compact one-layer transformers, so capacity scaling remains open.
  • The algebraic set is broad enough to expose a boundary but not exhaustive.
  • Circuit evidence is geometric and spectral; direct causal patching is future work.

Future Directions

  • Scale width, depth, and dataset support for non-abelian groups.
  • Run activation patching on candidate representation channels.
  • Test whether sparse MoE experts separate irreducible representation families.
  • Release a reproducible benchmark harness for algebraic grokking circuits.

References

BibTeX

@misc{pal2026grokking,
  title={Grokking Beyond Addition: Circuit-Level Analysis of Algebraic Learning in Transformers},
  author={Pal, Mani},
  year={2026},
  publisher={Zenodo},
  doi={10.5281/zenodo.19256207}
}