Non-Abelian Grokking Capacity Ceiling
Can the same one-layer transformer that groks abelian operations grok non-abelian groups under longer training?
Setup
- Used the grokking benchmark architecture at d_model=64.
- Extended non-abelian training horizons beyond the successful abelian window.
- Tracked train accuracy, test accuracy, CKA, and Peter-Weyl signatures.
Observations
- Training accuracy saturated at 100%.
- Test generalization remained stalled.
- Representation traces showed partial irreducible representation formation without full algorithmic generalization.
Conclusion
Longer training alone did not cross the boundary; the failure is likely capacity or representation-geometry constrained rather than an optimizer patience issue.
Next step
Scale width and depth independently while keeping the group family fixed.