第2章：RL 基础速通 —— 从 PG 到 PPO 到 GRPO

5 个核心 RL 算法,递进讲透:REINFORCE → Actor-Critic → PPO → DPO → GRPO → DAPO。每个给一句话直觉、关键公式、工程要点,最后回答:为什么 GRPO 在 2024-2025 突然取代 PPO 成为 LLM 后训练事实标准?它真的”更好”吗? 这一章是后续章节的数学底座,理解了它,看任何 RL 论文都不发怵。

📑 目录

1. RL 基本框架:Policy / Reward / Trajectory
2. Policy Gradient:REINFORCE
3. Actor-Critic:加 baseline 减方差
4. PPO:LLM RL 的”老国王”
5. DPO:no RL no RM 的旁路
6. GRPO:DeepSeek 的杀手锏
7. DAPO:GRPO 的工业改进
8. PPO vs GRPO vs DPO 对照表
自我检验清单
参考资料

1. RL 基本框架:Policy / Reward / Trajectory

1.1 三个核心概念

概念	含义	LLM 中对应
Policy $\pi_\theta(a\\|s)$	给定状态 s,输出动作 a 的概率分布	LLM 给定 prompt 输出下个 token 的分布
Trajectory $\tau$	状态-动作序列 $(s_0, a_0, s_1, a_1, ..., s_T, a_T)$	完整的生成序列(prompt + completion)
Reward $R(\tau)$	整条 trajectory 的奖励	数学答案对/错、unit test 通过率、verifier 评分

目标:最大化期望 reward

J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]

1.2 LLM 中的特殊性

维度	经典 RL	LLM RL
state	游戏画面、传感器输入	prompt + 已生成 token
action	离散(上下左右)/ 连续(扭矩)	next token(vocab 5 万维)
trajectory 长度	几百-几千步	几十-几千 token
reward 时机	每步都可能有	几乎只在末尾(整条序列对/错)
探索成本	几乎免费(模拟器)	每个 trajectory 都要 LLM 推理,极贵

🌟 Sparse end-of-sequence reward + 昂贵 rollout 是 LLM RL 的两个核心约束,塑造了所有算法的设计。

2. Policy Gradient:REINFORCE

2.1 一句话直觉

看效果好的 trajectory,把它们的概率往上抬;效果差的往下压。

2.2 公式

\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau)\right]

每个 token 的梯度被 trajectory 总 reward 加权:reward 高 → 增加这条路径概率;reward 低 → 减少。

2.3 致命问题:高方差

$R(\tau)$ 是单次 rollout 的随机量,方差极高 → 训练不稳定。

理论:    E[R] = 5(平均答得对)
单次 rollout:可能是 0(随机错了)或 10(随机对了)
高方差 → policy gradient 估计噪声大 → 模型乱学

2.4 工程实现(几乎不用)

# 伪代码,真实 RL 不会用 raw REINFORCE
for batch in data:
    trajectories = [rollout(p) for p in batch.prompts]
    rewards = [verify(t) for t in trajectories]
    
    loss = -sum(r * sum(log_pi(a, s) for s, a in t)
                for t, r in zip(trajectories, rewards))
    loss.backward()
    optimizer.step()

3. Actor-Critic:加 baseline 减方差

3.1 思想

不要看 reward 的绝对值,看相对于”平均水平”的 advantage。

引入 baseline(state-value function $V(s)$ ),把 $R$ 替换成 $A = R - V$ :

\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot A(s_t, a_t)\right]

$A > 0$ :这一步比平均好,push up
$A < 0$ :比平均差,push down
方差大幅降低

3.2 工程

需要训一个critic 网络 $V_\phi(s)$ 来估计 baseline,actor 和 critic 同时训:

Actor: π_θ(a|s)        ← policy
Critic: V_φ(s)         ← state value

问题:critic 跟 actor 一样大,训练成本翻倍。LLM RL 中这是个大头(后面 GRPO 就是干掉它)。

4. PPO:LLM RL 的”老国王”

4.1 一句话直觉

每次更新不要走太远,clip 一下;再加个 KL 罚把跑偏拉回来。

PPO(Proximal Policy Optimization, Schulman et al. 2017)是 RLHF 时代的事实标准,2017-2023 统治 LLM RL。

4.2 公式

Clipped Surrogate Objective:

\mathcal{L}^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right]

其中 $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ 是新旧 policy 的 likelihood ratio。

直觉:

ratio 在 $[1-\epsilon, 1+\epsilon]$ 内 → 正常学
超出 → clip 住,防止”一次更新跨太大步”

通常额外加 KL penalty 防止 policy 偏离 reference 太远:

\mathcal{L} = \mathcal{L}^{\text{CLIP}} - \beta \cdot D_{KL}(\pi_\theta || \pi_{\text{ref}})

4.3 RLHF 完整流程(InstructGPT 范式)

1. SFT:Base Model → SFT Model
2. RM 训练:用人类偏好数据 (a > b) 训 Reward Model R_φ
3. PPO:
   for each step:
     trajectories = rollout(SFT Model on prompts)
     rewards = R_φ(trajectories) - β·KL(π || π_ref)
     advantages = GAE(rewards, V_ψ)   ← 还要训 critic
     update π via clipped PG
     update V via MSE(V, returns)

4.4 PPO 的痛点

痛点	说明
Critic 贵	V 网络和 actor 一样大,显存 ×2,训练时间 ×2
Reward Model 贵	训 RM 需要大量人类偏好数据
KL coef 难调	$\beta$ 太小 → 漂移;太大 → 学不动
Long trajectory 难	LLM 序列长,advantage 估计 noisy
Reward shaping 多 hack	sparse reward 要靠 GAE、reward shaping 等手段

🍎 这些痛点正是 GRPO 要解决的。

5. DPO:no RL no RM 的旁路

5.1 思想

Rafailov et al. 2023 的关键洞察:RLHF 中”训 RM + PPO”两步可以数学等价地合并成”直接从偏好对训 policy”——完全跳过 RL。

5.2 公式

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)}\left[\log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]

只要喂 (prompt, win, lose) 三元组,普通监督式训练即可。

5.3 优劣

✅ 不需要 RM、不需要 rollout、训练像 SFT 一样快 ✅ 复现性极强(没有随机 rollout 的 noise) ❌ 只适合”偏好对”任务,无法处理”答案对/错” ❌ DPO 很容易过拟合(likelihood displacement 在 DPO 也存在) ❌ 无法做 multi-turn / tool use(没有 trajectory 概念)

5.4 适合场景

风格调整(温度、措辞、格式)
安全 alignment(避免有害输出)
已有大量人类标注偏好对的场景

DPO 在 2024 主导了一段时间,但 2025 RLVR/Agentic RL 兴起后,逐渐回归”alignment 工具”的定位——主线 reasoning / agent 训练还是 GRPO。

6. GRPO:DeepSeek 的杀手锏

6.1 一句话直觉

不要训 critic 估 baseline,直接在”同一 prompt 的多个采样”里算相对优势。

DeepSeekMath(2024-02)提出 Group Relative Policy Optimization,DeepSeek R1 把它推到 reasoning 主舞台,2025 年成为事实标准。

6.2 算法

for each prompt p:
    trajectories = [rollout(p) for _ in range(G)]   # G 通常 8-64
    rewards = [verify(t) for t in trajectories]      # 用 verifier(不是 RM!)
    
    # Group-relative advantage(替代 critic)
    μ = mean(rewards)
    σ = std(rewards)
    advantages = [(r - μ) / σ for r in rewards]
    
    # PPO-like clipped policy gradient
    for trajectory, advantage in zip(trajectories, advantages):
        for token in trajectory:
            ratio = π_θ(token) / π_old(token)
            loss += -min(ratio * advantage, clip(ratio, 1-ε, 1+ε) * advantage)
        loss += β * KL(π_θ || π_ref)   # 仍用 KL 防漂移
    
    backward + optimizer.step()

6.3 为什么 GRPO 能取代 PPO

维度	PPO	GRPO	节省
Critic 网络	必须训(同 actor 大小)	不需要(group 内算 baseline)	显存 50%
Reward Model	需要 RM	直接用 verifier	不用训 RM
Long trajectory advantage 估计	GAE 噪声大	Group 内统计稳定	训练更稳
实现复杂度	高(actor + critic + RM 三个模型)	低(只有 actor + ref)	工程量大幅降低

6.4 GRPO 的硬约束

必须有 verifier(数学答对、unit test、tool 成功)
每个 prompt 要采 G 个样本(rollout 成本 × G)
Group 内 reward 方差不能太小(否则 advantage = 0,无信号——下章详讲 advantage collapse)

6.5 数学等价性的洞察

GRPO 的 group baseline 等价于一个特殊的 critic——“在当前 policy 下,对这个 prompt 的期望 reward”。所以 GRPO 不是”不要 critic”,而是**“用 group 平均当 critic”**。

这个 baseline 有偏(只有 G 个样本),但计算几乎免费——这是 GRPO 的核心 trade-off。

7. DAPO:GRPO 的工业改进

ByteDance Seed 团队 2025 提出,GRPO 大规模生产时遇到的问题的系统化修复

7.1 改进点

问题	DAPO 解法
Entropy collapse(下章详)	Clip-Higher:把 PPO 的对称 clip $[1-\epsilon, 1+\epsilon]$ 改成不对称,允许 ratio 更大幅”探索”
Long sequence 训练偏向	Token-level Loss:loss 按 token 平均而非按 sequence,长序列不被惩罚
Reward outlier 干扰	Dynamic Sampling:reward 全 0 或全 1 的 group 跳过(没有信号)
Format reward 误导	Overlong Reward Shaping:超长输出有渐进式负 reward

7.2 实际效果

DAPO 论文报告:在 AIME 数学竞赛上,Qwen2.5-32B base + DAPO 训 50 步,达到 50% accuracy(vs DeepSeek R1-Zero 的 47%)——更稳更快。

7.3 何时该用 DAPO 而非纯 GRPO

训练长 trajectory(>4K token)
多 reward 同时优化(format + 答案 + cost)
已经被 entropy collapse 困扰
想复现 SOTA 推理模型

8. PPO vs GRPO vs DPO 对照表

维度	PPO	GRPO	DPO
是否在线 rollout	✅	✅	❌(纯离线)
是否需要 RM	✅	❌(用 verifier)	❌
是否需要 critic	✅	❌	❌
数据形态	(prompt, RM 打分)	(prompt, verifier)	(prompt, win, lose)
每 prompt 样本数	1	G(8-64)	1 对
显存	高(actor + critic + RM)	中(actor + ref)	低(actor + ref)
训练速度	慢	中	快(像 SFT)
适合任务	通用对齐	数学 / 代码 / 工具(可验证)	偏好对齐 / 风格 / 安全
主要用户	2017-2023 RLHF	2024-2025 reasoning / agent	2023-2024 alignment
代表模型	InstructGPT、ChatGPT	DeepSeek R1、Qwen QwQ	Zephyr、Tulu

🌟 2025 实战栈:SFT(cold start)→ GRPO(reasoning + agent)→ DPO(alignment polish) 的流水线。

PPO (Schulman et al., 2017):arXiv 1707.06347
DPO (Rafailov et al., 2023):arXiv 2305.18290
DeepSeekMath / GRPO (2024):arXiv 2402.03300
DeepSeek R1 (2025):arXiv 2501.12948
DAPO (ByteDance Seed, 2025):paper

入门博客

Beyond PPO: New Wave of Policy Optimization (yadnyesh):博文
An In-depth Walkthrough of GRPO (NVIDIA NeMo-RL):文档

框架文档(实战上手)

TRL DPO/GRPO trainers:github.com/huggingface/trl
OpenRLHF GRPO:github.com/openrlhf/openrlhf
verl GRPO recipes:verl.readthedocs.io