Independent Research / 2026

Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI

Layer-sensitive MPO tensorization with policy-guided bond dimensions

activeMani Pal

Model CompressionTensor NetworksMPOQuantizationLLM Evaluation

Memory reduction

93%

Best reproduced tensor-network compression setting.

Adaptive gain

+1.2%

Additional recovered accuracy over uniform schedules.

Benchmarks

MMLU, HellaSwag, BoolQ, TriviaQA, and GSM8K.

Abstract

This project reproduces and extends CompactifAI-style tensor-network compression on real open-weight LLMs. It profiles layer sensitivity, replaces uniform bond dimensions with adaptive schedules, and evaluates healing runs across standard language benchmarks.

Problem Statement

Uniform tensor-network compression treats transformer blocks as equally redundant, but LLMs show layer-specific fragility. The research question is whether adaptive bond-dimension assignment can preserve downstream quality at the same compression ratio.

Methodology

Implemented Matrix Product Operator tensorization for self-attention and MLP matrices using sequential SVD.
Swept bond dimension chi from 10 to 90 independently across attention blocks and layer types.
Trained a REINFORCE policy to assign per-block bond dimensions using downstream MMLU accuracy as reward.
Combined adaptive MPO schedules with model healing and optional soft gating adapters.

Experimental Design

Reproduced baseline compression behavior on LLaMA-3.2-1B and Qwen2.5-1.5B style targets.
Profiled 32 attention blocks and seven layer families before constructing a non-uniform compression schedule.
Ran one epoch of Alpaca-style healing after compression.
Evaluated with lm-evaluation-harness across MMLU, HellaSwag, BoolQ, TriviaQA, and GSM8K.

Results

Matched the original 93% memory-reduction target at 1B-scale reproduction settings.
Found initial blocks collapse below chi=50 while terminal blocks tolerate chi=10 with under 1% MMLU drop.
Adaptive policy recovered 1.2% additional accuracy at matched compression versus uniform chi baselines.
70% parameter reduction produced an observed 2% to 3% downstream accuracy drop after healing.

Limitations

The reproduction is computationally verifiable at smaller model scale, not a full 7B training campaign.
Policy search cost grows with block count and benchmark feedback latency.
Compression interacts with quantization and adapter healing in ways that need more isolation.

Future Directions

Replace REINFORCE with differentiable schedule search or bandit-style block allocation.
Profile attention heads and MLP projections separately instead of block-level schedules.
Test adaptive tensorization under long-context inference and KV-cache pressure.
Publish per-layer sensitivity traces as reusable compression priors.

References

CompactifAI lm-evaluation-harness

BibTeX

@misc{pal2026compactifai,
  title={Adaptive Tensor-Network Compression of LLMs: An Extension of CompactifAI},
  author={Pal, Mani},
  year={2026},
  note={Independent research manuscript}
}