第4章：Verifier-based RL 与 Reward 设计

GRPO 把 critic 干掉了,但 reward 信号还是 RL 的灵魂——而且verifier 比 RM 更脆弱、更容易被 hack。本章把 RLVR 的本体讲透:为什么 2024-2025 verifier 取代 RM、不同任务的 reward 设计哲学(数学/代码/工具/格式)、PRM 与 ORM 的取舍、Reward Hacking 在 RLVR 时代的新形态、以及一份”可直接抄作业”的 reward 模板。

📑 目录

1. 从 RLHF 到 RLVR:reward 范式转变
2. RLVR 的灵魂:Verifier 设计
3. 数学任务的 Reward
4. 代码任务的 Reward
5. 工具调用的 Reward
6. Multi-objective Reward 工程
7. PRM vs ORM:过程 vs 结果
8. Reward Hacking 防御
自我检验清单
参考资料

1. 从 RLHF 到 RLVR:reward 范式转变

1.1 三代 reward 信号的演进

时代	Reward 来源	缺点
RLHF(2022-2023)	训练 RM(reward model)拟合人类偏好	贵、慢、噪声大、容易 hack
RLAIF(2023-2024)	用 LLM 当 judge 给 reward	依赖 judge 稳定性
RLVR(2024-)	可验证规则 —— 数学答对、unit test 通过、tool 成功	适用范围窄(必须有 ground truth)

1.2 RLVR 为什么爆发

RLHF time:     RM ≈ 70% 准确率           ← noisy 信号
              + RM 训练贵                ← 数据成本
              + 容易 reward hacking      ← RM 本身有 bug
              ↓
              "Sycophancy / length / format" 等 hack 横行

RLVR time:     verifier ≈ 99% 准确率     ← 干净信号
              + 几乎免费                ← 规则即代码
              + hack 仍存在但更罕见      ← verifier 写得好
              ↓
              R1 / o1 等 reasoning 大模型涌现

🌟 关键洞察:reward 信号”干净度”是 RL 训练能否 scale 的天花板。RM 的天花板就在那里,verifier 直接把天花板捅破——这就是 2024-2025 RL 大爆发的根本原因。

1.3 适用范围对照

任务	RLHF	RLVR
对话风格	✅	❌(没有”对错”)
数学题	❌(RM 贵)	✅(答案 EM)
代码	❌(RM 学不会代码评估)	✅(unit test)
工具调用	❌	✅(调用结果验证)
长文写作	✅	❌
安全	✅	部分(规则可拦)

🍎 现代实战:RLVR(reasoning)+ RLHF/DPO(alignment)双轨——大头用 RLVR,最后润色用 RLHF/DPO。

2. RLVR 的灵魂:Verifier 设计

2.1 三个核心准则

① Strict & Unambiguous(严格无歧义)

verifier 必须确定性——同样输入永远同样输出。

# ❌ 不严格:LLM 当 judge
reward = llm.judge("回答是否正确?")  # 不同温度、不同 prompt 不同结果

# ✅ 严格:精确匹配
reward = float(extract_answer(response) == gold_answer)

② Local & Cheap(局部计算便宜)

每个 trajectory 都要算 reward,不能贵:

# ❌ 贵:启动一个 docker 跑代码
reward = run_in_sandbox(code, tests)  # 30 秒 / 次

# ✅ 便宜:静态 lint + 单测部分跑
reward = quick_check(code) * 0.5 + run_unit_tests(code, n=3) * 0.5

③ Hack-resistant(抗 hack)

verifier 不能用一个简单 trick 就刷满:

# ❌ 容易 hack:只看末尾数字
reward = "42" in response[-50:]  # 模型学会硬塞 "42" 在末尾

# ✅ 抗 hack:格式 + 内容双重
reward = (
    has_correct_format(response) *
    extract_boxed_answer(response) == gold_answer
)

2.2 设计 verifier 的五步流程

Step 1: 列出"答案对/错"的所有可能形式
   如:数学答案可能是数字 / 表达式 / 多个解 / 区间

Step 2: 写一个 normalize 函数
   把所有形式归一化(去空格 / latex 转换 / 单位统一)

Step 3: 写 strict 比对函数
   exact match / sympy 化简后比 / mod 大数

Step 4: 写抗 hack 的"合法性"检查
   答案在 [thought] 后面;不能在 prompt 里复述

Step 5: 用 100-1000 条样本验证
   人工标注 → 看 verifier 准确率 → 迭代

🍎 经验:verifier 自己应该有 99%+ 准确率,否则 RL 训练像”用错的尺子量身高”——越训越歪。

3. 数学任务的 Reward

3.1 简单情况:数值答案

import re
import sympy

def extract_boxed_answer(response: str) -> str | None:
    """提取 \\boxed{...} 内的答案。"""
    match = re.search(r'\\boxed\{([^}]+)\}', response)
    return match.group(1).strip() if match else None

def math_reward(response: str, gold: str) -> float:
    answer = extract_boxed_answer(response)
    if answer is None:
        return 0.0   # 未按格式
    try:
        # sympy 化简比较(处理 1/2 vs 0.5 vs 50%)
        ans_expr = sympy.sympify(answer)
        gold_expr = sympy.sympify(gold)
        if sympy.simplify(ans_expr - gold_expr) == 0:
            return 1.0
    except Exception:
        # 退化到字符串
        return float(answer.strip() == gold.strip())
    return 0.0

3.2 进阶:多个等价形式

def math_reward_v2(response, gold):
    answer = extract_boxed_answer(response)
    if not answer:
        return 0.0
    
    # 多种 normalization
    candidates = [
        answer,
        answer.replace(" ", ""),
        answer.lower(),
        # latex 反转义
        answer.replace("\\frac", "frac").replace("\\sqrt", "sqrt"),
    ]
    return max(symbolic_equal(c, gold) for c in candidates)

3.3 加 format reward(but 小心 hack)

def math_reward_with_format(response, gold):
    base_reward = math_reward(response, gold)
    
    # Format bonus(小心:format 占比不能太大)
    format_bonus = 0.0
    if "<think>" in response and "</think>" in response:
        format_bonus += 0.05
    if "\\boxed{" in response:
        format_bonus += 0.05
    
    # 总 reward 上限 1.1,format 最多贡献 10%
    return min(base_reward + format_bonus, 1.1)

🌟 DeepSeek R1 训练里:只用纯结果 reward(boxed answer 对错),format 不单独加——因为 format reward 容易让模型 spam tag,降低实际推理质量。

4. 代码任务的 Reward

4.1 完整设计

import subprocess
import tempfile

def code_reward(response: str, problem: dict) -> float:
    code = extract_code(response)   # 提取 ```python ... ```
    if not code:
        return 0.0
    
    rewards = []
    
    # 1. 静态:syntactically valid?
    try:
        compile(code, "<string>", "exec")
        rewards.append(("syntax", 1.0, 0.1))   # 通过 → 0.1
    except SyntaxError:
        rewards.append(("syntax", 0.0, 0.1))
    
    # 2. 单测:run tests
    pass_rate = run_tests(code, problem["tests"])
    rewards.append(("tests", pass_rate, 0.7))   # 通过率 → 0.7
    
    # 3. 时间:不能超时
    duration = run_with_timeout(code, problem["tests"], timeout=10)
    if duration > 5:
        rewards.append(("speed", 0.5, 0.1))
    else:
        rewards.append(("speed", 1.0, 0.1))
    
    # 4. 安全:不能调危险 API
    if has_dangerous_imports(code):
        rewards.append(("safety", 0.0, 0.1))
    else:
        rewards.append(("safety", 1.0, 0.1))
    
    return sum(r * w for _, r, w in rewards)

def run_tests(code, tests):
    passed = 0
    for test in tests:
        result = run_in_sandbox(code + "\n" + test)
        passed += 1 if result.passed else 0
    return passed / len(tests)

4.2 Sandbox 是必须的

代码执行必须 sandbox(docker / firejail / nsjail),否则 model 训出”rm -rf /”你直接哭。生产中通常用 e2b、modal、Daytona 等 cloud sandbox 服务。

4.3 LiveCodeBench / SWE-bench 范式

Benchmark	Verifier 形式
LiveCodeBench	函数级 unit test
HumanEval	函数级 unit test
SWE-bench	仓库级:apply patch → run repo tests
MBPP	函数级 unit test

仓库级(SWE-bench)的 verifier 比函数级贵 100x——一次 patch 可能要跑 5-30 分钟测试。这就是为什么 SWE 类 RL 训练特别慢、特别贵。

5. 工具调用的 Reward

5.1 完整模板(Search 工具示例)

def search_agent_reward(trajectory, problem) -> float:
    final_answer = extract_final_answer(trajectory)
    
    rewards = {}
    
    # 1. 答案正确性(主要)
    rewards["correctness"] = float(final_answer.strip() == problem["gold"])
    
    # 2. Tool format(每次调用格式正确)
    n_calls = count_tool_calls(trajectory)
    n_valid_calls = count_valid_tool_calls(trajectory)
    rewards["format"] = n_valid_calls / max(n_calls, 1)
    
    # 3. 调用成功率(tool 没报错)
    n_success = count_successful_calls(trajectory)
    rewards["call_success"] = n_success / max(n_calls, 1)
    
    # 4. Cost penalty(调用次数惩罚,鼓励高效)
    if n_calls <= 3:
        rewards["cost"] = 1.0
    elif n_calls <= 6:
        rewards["cost"] = 0.7
    elif n_calls <= 10:
        rewards["cost"] = 0.3
    else:
        rewards["cost"] = 0.0
    
    # 5. Length penalty(超长输出)
    n_tokens = len(trajectory.tokens)
    if n_tokens > 8000:
        rewards["length"] = 0.5
    else:
        rewards["length"] = 1.0
    
    # 加权
    weights = {"correctness": 0.7, "format": 0.05, "call_success": 0.1, "cost": 0.1, "length": 0.05}
    return sum(rewards[k] * w for k, w in weights.items())

5.2 Search-R1 论文的 reward

Search-R1(arXiv 2503.09516)用了一个简洁版:

r = correctness  +  α * format_validity

只看答案 + 格式,不直接奖励调用次数——避免 spam search。weight α 极小(~0.1)。

5.3 ToolRL 的 emergent capability

ToolRL 论文展示:只用 correctness + format reward,模型自动学会:

调用失败时切换工具(self-correction)
不必要的调用减少(efficiency)
多个工具组合(composition)

🍎 关键启示:简洁 reward 比复杂 reward 更容易让能力涌现——多目标 reward 容易互相冲突,把”信号”稀释掉。

6. Multi-objective Reward 工程

6.1 三种组合方式

方式	公式	优劣
Linear weighted	$r = \sum w_i r_i$	简单,但权重难调,信号易被稀释
Multiplicative	$r = \prod r_i$	任一维度 0 → 全 0,信号强但严格
Curriculum	第 N 阶段才开新 reward 维度	渐进,适合长训练

6.2 经验

数据点	经验
reward 维度 ≤ 3 个	信号最干净,推荐
主 reward 占 ≥ 60%	防止次要 reward 反客为主
Format reward 单独占比 < 10%	避免 format gaming
Cost penalty 在末期才加	早期加会抑制探索

6.3 Curriculum 示例

Phase 1 (step 0-1000):     reward = correctness                     # 先学对错
Phase 2 (step 1000-3000):  reward = correctness + 0.1 * format       # 加格式
Phase 3 (step 3000+):      reward = correctness + 0.1 * format + 0.1 * cost  # 最后加 cost

7. PRM vs ORM:过程 vs 结果

7.1 定义

类型	Reward 时机
Outcome Reward Model (ORM)	只看最终结果
Process Reward Model (PRM)	每个中间步骤都打分

7.2 数学例子

Question: 解方程 2x + 4 = 10

模型回答:
  Step 1: 移项 → 2x = 10 - 4 = 6     ← PRM 给 1.0
  Step 2: 除 2 → x = 3                ← PRM 给 1.0
  最终答案:x = 3                      ← ORM 给 1.0

错误中间步:

Step 1: 移项 → 2x = 10 + 4 = 14      ← PRM 给 0.0(错!)
Step 2: 除 2 → x = 7                  ← PRM 给 0.0
最终:x = 7                            ← ORM 给 0.0

7.3 优劣

维度	ORM	PRM
训练成本	低(只 verify 末尾)	高(每步都 verify)
标注成本	低	极高(标注每个 step)
抗 reward hacking	弱(只看结果)	强(中间错就拦)
长链推理 credit assignment	差	好
适合任务	简单数学、QA	复杂多步推理

7.4 工业现状

DeepSeek R1:纯 ORM,简洁有效
OpenAI o1(推测):疑似有 PRM 阶段
Math-Shepherd / OmegaPRM:从 trajectory 自动生成 PRM 标注(降低标注成本)

🌟 趋势:自动 PRM——让模型/MCTS 自己生成中间步骤的对错标注,然后训练 PRM,再用 PRM 给 main policy 打分。这绕开了人工标注的瓶颈。

8. Reward Hacking 防御

第 3 章已经讲过现象,这里给具体防御工具箱:

8.1 工具一:多 verifier 投票

def robust_reward(response, problem):
    verifiers = [
        exact_match_verifier,
        sympy_verifier,
        llm_judge_verifier,
    ]
    votes = [v(response, problem) for v in verifiers]
    # 至少 2 个 verifier 同意才给 1
    return float(sum(1 for v in votes if v > 0.5) >= 2)

8.2 工具二:KL Hard Cap

不管 reward 怎么涨,policy 偏离 ref 太远直接停训或 reset:

kl_hard_cap: 15.0
on_kl_exceed: reset_to_ref   # 或 reduce_lr / stop_training

8.3 工具三:Reward 高的 trajectory 抽样人审

每天/每周抽 100 条 reward 最高的 trajectory 人工看,专门看是否 hack。这是最便宜也最有效的防御。

8.4 工具四:Adversarial verifier

写一个专门的 verifier 找”看起来对但实际 hack”的模式:

ADVERSARIAL_PATTERNS = [
    r"答案显然是 \d+",      # 直接断言不推理
    r"based on the above, .{0,20}\d+",   # 模板答案
    re.compile(r"</think>.*\\boxed", re.DOTALL),  # think 后立即 boxed,无推理
]

def is_hacking(response):
    return any(p.search(response) for p in ADVERSARIAL_PATTERNS)

# 训练 reward 减一项
reward = base_reward - 0.5 * is_hacking(response)

8.5 工具五:Length / Format / Cost 配额

# 硬配额:超过就扣
if n_tool_calls > 20: reward -= 0.5
if response_length > 16000: reward -= 0.3
if response.count("<think>") > 10: reward -= 0.5  # 防 think spam

DeepSeek R1:arXiv 2501.12948 —— RLVR 标杆
Math-Shepherd:自动 PRM 标注
OmegaPRM:Google,MCTS 驱动 PRM
Process- and Outcome-based Feedback (Lightman et al., 2023) —— PRM/ORM 对比经典
DAPO:paper —— overlong reward shaping、dynamic sampling

行业讨论

UC Berkeley RDI: How We Broke Top AI Agent Benchmarks (2026-04):博文 —— reward hacking 大事件
Trustworthy Benchmarks(Berkeley 2026):博文

实战代码

TRL DPO/GRPO with verifier examples:github.com/huggingface/trl
OpenRLHF reward functions:github.com/openrlhf/openrlhf
verl reward 模板:verl.readthedocs.io
e2b sandbox(代码 verifier 沙箱):e2b.dev

搜索