Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

Independent Research / 2026

Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI

Layer-sensitive MPO tensorization with policy-guided bond dimensions

activeMani Pal
Model CompressionTensor NetworksMPOQuantizationLLM Evaluation

Memory reduction

93%

Best reproduced tensor-network compression setting.

Adaptive gain

+1.2%

Additional recovered accuracy over uniform schedules.

Benchmarks

5

MMLU, HellaSwag, BoolQ, TriviaQA, and GSM8K.

Abstract

This project reproduces and extends CompactifAI-style tensor-network compression on real open-weight LLMs. It profiles layer sensitivity, replaces uniform bond dimensions with adaptive schedules, and evaluates healing runs across standard language benchmarks.

Problem Statement

Uniform tensor-network compression treats transformer blocks as equally redundant, but LLMs show layer-specific fragility. The research question is whether adaptive bond-dimension assignment can preserve downstream quality at the same compression ratio.

Methodology

  • Implemented Matrix Product Operator tensorization for self-attention and MLP matrices using sequential SVD.
  • Swept bond dimension chi from 10 to 90 independently across attention blocks and layer types.
  • Trained a REINFORCE policy to assign per-block bond dimensions using downstream MMLU accuracy as reward.
  • Combined adaptive MPO schedules with model healing and optional soft gating adapters.

Experimental Design

  • Reproduced baseline compression behavior on LLaMA-3.2-1B and Qwen2.5-1.5B style targets.
  • Profiled 32 attention blocks and seven layer families before constructing a non-uniform compression schedule.
  • Ran one epoch of Alpaca-style healing after compression.
  • Evaluated with lm-evaluation-harness across MMLU, HellaSwag, BoolQ, TriviaQA, and GSM8K.

Results

  • Matched the original 93% memory-reduction target at 1B-scale reproduction settings.
  • Found initial blocks collapse below chi=50 while terminal blocks tolerate chi=10 with under 1% MMLU drop.
  • Adaptive policy recovered 1.2% additional accuracy at matched compression versus uniform chi baselines.
  • 70% parameter reduction produced an observed 2% to 3% downstream accuracy drop after healing.

Limitations

  • The reproduction is computationally verifiable at smaller model scale, not a full 7B training campaign.
  • Policy search cost grows with block count and benchmark feedback latency.
  • Compression interacts with quantization and adapter healing in ways that need more isolation.

Future Directions

  • Replace REINFORCE with differentiable schedule search or bandit-style block allocation.
  • Profile attention heads and MLP projections separately instead of block-level schedules.
  • Test adaptive tensorization under long-context inference and KV-cache pressure.
  • Publish per-layer sensitivity traces as reusable compression priors.

References

BibTeX

@misc{pal2026compactifai,
  title={Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI},
  author={Pal, Mani},
  year={2026},
  note={Independent research manuscript}
}