System case study / 2024-2025
Project Chimera
700M hybrid Mamba-2 and Transformer LLM trained from scratch
activePyTorch / Mamba-2 / DeepSpeed / Triton / GGUF / llama.cpp / Weights & Biases
LLM TrainingMambaTransformersAlignmentTokenizer
Parameters
700M
Hybrid Mamba-2 and Transformer model.
Artifact
4.2GB
INT4 GGUF quantized checkpoint.
Latency
<3s
First token on CPU-only inference.
Motivation
Own the full path from tokenizer and architecture decisions through pretraining, reasoning fine-tuning, alignment, quantization, and local inference.
Design Constraints
- Train a 700M model from scratch rather than fine-tuning an existing checkpoint.
- Use a hybrid SSM-attention stack to trade long-context efficiency against attention expressivity.
- Support Hindi and English corpora with a custom BPE tokenizer.
- Produce deployable INT4 GGUF artifacts for CPU-only inference.
System Architecture
- Hybrid Mamba-2 and Transformer architecture with SSM layers for linear-complexity context handling.
- Interleaved attention at key depths to preserve token mixing behavior.
- Custom BPE tokenizer trained on Hindi and English data.
- GRPO chain-of-thought reasoning fine-tuning followed by DPO safety alignment and QLoRA adaptation.
Performance Bottlenecks
- KV-cache memory pressure for extended context.
- Tokenizer balance across Hindi and English corpora.
- Compute budget constraints around Chinchilla-style data scheduling.
- Quantization quality tradeoffs at CPU inference targets.
Optimization Decisions
- Documented YaRN and LongRoPE investigations for context extension.
- Used GGUF INT4 deployment path for local inference footprint.
- Tracked training and evaluation in W&B for reproducibility.
- Benchmarked with MMLU and HumanEval to separate memorization from useful capability.
Benchmark Methodology
- Evaluated downstream capability on MMLU and HumanEval.
- Measured first-token latency with CPU-only inference.
- Compared quantized artifact size and serving behavior.
- Published weights and benchmark results to Hugging Face Hub.
Results
- Completed pretraining of a 700M-parameter model from scratch.
- Produced 4.2GB INT4 GGUF model artifact.
- Reached sub-3-second first-token latency on CPU-only inference.
- Published model weights, logs, and architecture decisions.
Lessons Learned
- Tokenizer design becomes a system constraint, not a preprocessing detail.
- Hybrid SSM-attention architecture pushes complexity into evaluation and long-context validation.
- Alignment work is only interpretable when pretraining and benchmark traces are preserved.