Mani Pal

Engineer-researcher

Mani Pal

LLM systems, CUDA kernels, inference optimization, compression, interpretability, and distributed AI infrastructure.

System case study / 2024-2025

Project Chimera

700M hybrid Mamba-2 and Transformer LLM trained from scratch

activePyTorch / Mamba-2 / DeepSpeed / Triton / GGUF / llama.cpp / Weights & Biases
LLM TrainingMambaTransformersAlignmentTokenizer

Parameters

700M

Hybrid Mamba-2 and Transformer model.

Artifact

4.2GB

INT4 GGUF quantized checkpoint.

Latency

<3s

First token on CPU-only inference.

Motivation

Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.

Design Constraints

  • Train a 700M model from scratch rather than fine-tuning an existing checkpoint.
  • Use a hybrid SSM-attention stack to trade long-context efficiency against attention expressivity.
  • Support Hindi and English corpora with a custom BPE tokenizer.
  • Produce deployable INT4 GGUF artifacts for CPU-only inference.

System Architecture

  • Hybrid Mamba-2 and Transformer architecture with SSM layers for linear-complexity context handling.
  • Interleaved attention at key depths to preserve token mixing behavior.
  • Custom BPE tokenizer trained on Hindi and English data.
  • GRPO chain-of-thought reasoning fine-tuning followed by DPO safety alignment and QLoRA adaptation.

Performance Bottlenecks

  • KV-cache memory pressure for extended context.
  • Tokenizer balance across Hindi and English corpora.
  • Compute budget constraints around Chinchilla-style data scheduling.
  • Quantization quality tradeoffs at CPU inference targets.

Optimization Decisions

  • Documented YaRN and LongRoPE investigations for context extension.
  • Used GGUF INT4 deployment path for local inference footprint.
  • Tracked training and evaluation in W&B for reproducibility.
  • Benchmarked with MMLU and HumanEval to separate memorization from useful capability.

Benchmark Methodology

  • Evaluated downstream capability on MMLU and HumanEval.
  • Measured first-token latency with CPU-only inference.
  • Compared quantized artifact size and serving behavior.
  • Published weights and benchmark results to Hugging Face Hub.

Results

  • Completed pretraining of a 700M-parameter model from scratch.
  • Produced 4.2GB INT4 GGUF model artifact.
  • Reached sub-3-second first-token latency on CPU-only inference.
  • Published model weights, logs, and architecture decisions.

Lessons Learned

  • Tokenizer design becomes a system constraint, not a preprocessing detail.
  • Hybrid SSM-attention architecture pushes complexity into evaluation and long-context validation.
  • Alignment work is only interpretable when pretraining and benchmark traces are preserved.