System case study / 2024-2025

Project Chimera

700M hybrid Mamba-2 and Transformer LLM trained from scratch

activePyTorch / Mamba-2 / DeepSpeed / Triton / GGUF / llama.cpp / Weights & Biases

LLM TrainingMambaTransformersAlignmentTokenizer

Parameters

700M

Hybrid Mamba-2 and Transformer model.

Artifact

4.2GB

INT4 GGUF quantized checkpoint.

Latency

<3s

First token on CPU-only inference.

Motivation

Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.

Train a 700M model from scratch rather than fine-tuning an existing checkpoint.
Use a hybrid SSM-attention stack to trade long-context efficiency against attention expressivity.
Support Hindi and English corpora with a custom BPE tokenizer.
Produce deployable INT4 GGUF artifacts for CPU-only inference.

Hybrid Mamba-2 and Transformer architecture with SSM layers for linear-complexity context handling.
Interleaved attention at key depths to preserve token mixing behavior.
Custom BPE tokenizer trained on Hindi and English data.
GRPO chain-of-thought reasoning fine-tuning followed by DPO safety alignment and QLoRA adaptation.

Documented YaRN and LongRoPE investigations for context extension.
Used GGUF INT4 deployment path for local inference footprint.
Tracked training and evaluation in W&B for reproducibility.
Benchmarked with MMLU and HumanEval to separate memorization from useful capability.

Tokenizer design becomes a system constraint, not a preprocessing detail.
Hybrid SSM-attention architecture pushes complexity into evaluation and long-context validation.
Alignment work is only interpretable when pretraining and benchmark traces are preserved.