πŸ“– The Complete Guide to Post-Training of Large Language Models

From Pretraining to Alignment β€” Everything You Need to Know


Who is this for? You've learned how pretraining works β€” you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens after pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.


πŸ—ΊοΈ Roadmap

# Chapter What You'll Learn
1 The Big Picture Why pretrained models aren't useful yet; the 3-stage pipeline
2 SFT Supervised Fine-Tuning β€” loss function, data formats, key papers
3 RLHF Reward models, PPO, KL divergence, reward hacking
4 DPO Direct Preference Optimization β€” RLHF without RL
5 Preference Zoo KTO, ORPO, SimPO, CPO, IPO, Online DPO
6 GRPO & Reasoning DeepSeek-R1, reward functions, the reasoning revolution
7 PEFT LoRA, QLoRA β€” fine-tune on consumer GPUs
8 Toolbox TRL, Transformers, vLLM, Accelerate, DeepSpeed
9 Datasets What to train on β€” curated lists with Hub links
10 Evaluation Benchmarks, LLM-as-Judge, human eval
11 Full Recipe End-to-end pipeline with code
12 Reading List 18 must-read papers in 4 tiers

The Three Stages of Post-Training

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ STAGE 1: SFT β”‚ ──> β”‚ STAGE 2: Reward  β”‚ ──> β”‚ STAGE 3: RL             β”‚
β”‚              β”‚     β”‚ Model Training   β”‚     β”‚ (PPO / DPO / GRPO)      β”‚
β”‚ Teach format β”‚     β”‚ Learn preferencesβ”‚     β”‚ Optimize for preferencesβ”‚
β”‚ & behavior   β”‚     β”‚ from comparisons β”‚     β”‚ while staying close to  β”‚
β”‚              β”‚     β”‚                  β”‚     β”‚ the SFT model           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Input: Pretrained LM                         Output: Aligned Assistant

The Evolution Timeline

Year Method Key Idea
2017 RLHF (original) Human preferences β†’ reward model β†’ RL
2020 RLHF for LLMs Applied to text summarization
2022 InstructGPT Full SFT β†’ RM β†’ PPO pipeline
2022 Constitutional AI AI feedback replaces human feedback
2023 DPO No reward model needed β€” direct optimization
2024 KTO / ORPO Binary feedback / combined SFT+preference
2024 GRPO Group-based RL for reasoning (DeepSeek)
2025 DeepSeek-R1 RL teaches chain-of-thought from scratch