π The Complete Guide to Post-Training of Large Language Models
From Pretraining to Alignment β Everything You Need to Know
Who is this for? You've learned how pretraining works β you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens after pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.
πΊοΈ Roadmap
| # | Chapter | What You'll Learn |
|---|---|---|
| 1 | The Big Picture | Why pretrained models aren't useful yet; the 3-stage pipeline |
| 2 | SFT | Supervised Fine-Tuning β loss function, data formats, key papers |
| 3 | RLHF | Reward models, PPO, KL divergence, reward hacking |
| 4 | DPO | Direct Preference Optimization β RLHF without RL |
| 5 | Preference Zoo | KTO, ORPO, SimPO, CPO, IPO, Online DPO |
| 6 | GRPO & Reasoning | DeepSeek-R1, reward functions, the reasoning revolution |
| 7 | PEFT | LoRA, QLoRA β fine-tune on consumer GPUs |
| 8 | Toolbox | TRL, Transformers, vLLM, Accelerate, DeepSpeed |
| 9 | Datasets | What to train on β curated lists with Hub links |
| 10 | Evaluation | Benchmarks, LLM-as-Judge, human eval |
| 11 | Full Recipe | End-to-end pipeline with code |
| 12 | Reading List | 18 must-read papers in 4 tiers |
The Three Stages of Post-Training
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
β STAGE 1: SFT β ββ> β STAGE 2: Reward β ββ> β STAGE 3: RL β
β β β Model Training β β (PPO / DPO / GRPO) β
β Teach format β β Learn preferencesβ β Optimize for preferencesβ
β & behavior β β from comparisons β β while staying close to β
β β β β β the SFT model β
βββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββββββββββ
Input: Pretrained LM Output: Aligned Assistant
The Evolution Timeline
| Year | Method | Key Idea |
|---|---|---|
| 2017 | RLHF (original) | Human preferences β reward model β RL |
| 2020 | RLHF for LLMs | Applied to text summarization |
| 2022 | InstructGPT | Full SFT β RM β PPO pipeline |
| 2022 | Constitutional AI | AI feedback replaces human feedback |
| 2023 | DPO | No reward model needed β direct optimization |
| 2024 | KTO / ORPO | Binary feedback / combined SFT+preference |
| 2024 | GRPO | Group-based RL for reasoning (DeepSeek) |
| 2025 | DeepSeek-R1 | RL teaches chain-of-thought from scratch |