📖 The Complete Guide to Post-Training of Large Language Models

From Pretraining to Alignment — Everything You Need to Know

Who is this for? You've learned how pretraining works — you understand GPT-2, transformer architectures, next-token prediction, and the cross-entropy loss. Now you want to understand what happens after pretraining: how raw language models become helpful assistants like ChatGPT, Claude, and Gemini.

🗺️ Roadmap

#	Chapter	What You'll Learn
1	The Big Picture	Why pretrained models aren't useful yet; the 3-stage pipeline
2	SFT	Supervised Fine-Tuning — loss function, data formats, key papers
3	RLHF	Reward models, PPO, KL divergence, reward hacking
4	DPO	Direct Preference Optimization — RLHF without RL
5	Preference Zoo	KTO, ORPO, SimPO, CPO, IPO, Online DPO
6	GRPO & Reasoning	DeepSeek-R1, reward functions, the reasoning revolution
7	PEFT	LoRA, QLoRA — fine-tune on consumer GPUs
8	Toolbox	TRL, Transformers, vLLM, Accelerate, DeepSpeed
9	Datasets	What to train on — curated lists with Hub links
10	Evaluation	Benchmarks, LLM-as-Judge, human eval
11	Full Recipe	End-to-end pipeline with code
12	Reading List	18 must-read papers in 4 tiers

The Three Stages of Post-Training

┌─────────────┐     ┌──────────────────┐     ┌─────────────────────────┐
│ STAGE 1: SFT │ ──> │ STAGE 2: Reward  │ ──> │ STAGE 3: RL             │
│              │     │ Model Training   │     │ (PPO / DPO / GRPO)      │
│ Teach format │     │ Learn preferences│     │ Optimize for preferences│
│ & behavior   │     │ from comparisons │     │ while staying close to  │
│              │     │                  │     │ the SFT model           │
└─────────────┘     └──────────────────┘     └─────────────────────────┘

Input: Pretrained LM                         Output: Aligned Assistant

The Evolution Timeline

Year	Method	Key Idea
2017	RLHF (original)	Human preferences → reward model → RL
2020	RLHF for LLMs	Applied to text summarization
2022	InstructGPT	Full SFT → RM → PPO pipeline
2022	Constitutional AI	AI feedback replaces human feedback
2023	DPO	No reward model needed — direct optimization
2024	KTO / ORPO	Binary feedback / combined SFT+preference
2024	GRPO	Group-based RL for reasoning (DeepSeek)
2025	DeepSeek-R1	RL teaches chain-of-thought from scratch