Agent Memory 实证审计与负结果方法论 2026年5月11日

第4章：预注册 + 配对评测三件套——TOST、McNemar、Cluster Bootstrap

EXPERIMENTS_PREREG.md 模板、DEVIATIONS.md 写法、binary acc vs F1 主端点选择、TOST 等价性检验、McNemar 配对显著性、conversation-clustered bootstrap CI、cumulative-effect plot 防 optional stopping

预注册 TOST McNemar bootstrap optional stopping cumulative effect

跑完 50 题正向就停手发论文，是 NLP 实证研究最容易踩的雷。reviewer 一句”你怎么证明你不是跑到这个数字才停的”就能把整篇论文打回原形。本章给一套预注册 + 配对评测的工程化模板：跑实验前把所有 degree of freedom 锁死（hypothesis、主端点、sample size、决策规则），跑完用 McNemar / TOST / clustered bootstrap 报齐显著性，并用 cumulative-effect plot 在论文里展示”我们没在中途偷停”。三件套用 scipy 几十行代码跑通，但每一件做对了，论文可信度上一个台阶。

📑 目录

1. 预注册：写在跑实验之前的承诺
2. 主端点选 binary acc 还是 F1
3. McNemar 配对检验
4. TOST 等价性检验
5. Conversation-clustered bootstrap
6. Cumulative-effect plot：抵御 optional stopping 指控
7. DEVIATIONS.md：偏离预注册时怎么办
✅ 自我检验清单
📚 参考资料

1. 预注册：写在跑实验之前的承诺

1.1 为什么需要预注册

NLP 实证研究的”researcher degree of freedom”极高：

- 跑哪个 backbone？      4-8 个候选
- 主端点 acc 还是 F1？   2 个
- 样本量 100 还是 500？  几乎连续
- τ 阈值 0.30 还是 0.50？连续
- 哪几个 ability 类？    可挑选
- prompt 模板 v1/v2/v3？ 可换

如果你跑完看结果再决定上述 6 项，每项 2 个选择，实际是在 64 条平行宇宙里挑那条最显著的。这就是”researcher degrees of freedom”——它能把 0.05 的真实 p 值膨胀到 0.30+。

🍎 直觉：预注册 = 把骰子先扔出去，把面朝上的牌写下来，再翻。骰子是”未来的实验结果”，牌是”决策规则”。

1.2 EXPERIMENTS_PREREG.md 模板

# Experiments Pre-Registration

## Hypotheses

**H1a**: ephemeral failure-triggered synthesis (C3) improves paired binary accuracy
       by ≥ 2.0 pp over zero-build retrieval (C1) on LongMemEval-S, with
       McNemar p < 0.05.

**H1b**: persistent caching (C4) improves paired binary accuracy by ≥ 2.0 pp
       over ephemeral (C3) on LoCoMo, with conversation-clustered McNemar
       p < 0.05.

## Primary endpoint
- **paired binary accuracy** (per-question 0/1)
- secondary: token-F1 (LoCoMo official)

## Sample size
- H1a: n=500 (LongMemEval-S full); paired against C1.
- H1b: n=764 (LoCoMo, 5 conv × all QA paired).

## Decision rules
- Pass H1a iff Δ(C3-C1) ≥ +2.0 pp AND McNemar p < 0.05 AND 95% CI excludes 0.
- Pass H1b iff Δ(C4-C3) ≥ +2.0 pp AND clustered McNemar p < 0.05 AND 95% CI excludes 0.

## SESOI (smallest effect size of interest)
- ±2 pp on binary accuracy

## Statistical tests
- McNemar (scipy.stats.binomtest with min(n01,n10), n=n01+n10)
- Paired bootstrap 95% CI (2000 resamples, conversation-clustered for H1b)
- TOST equivalence test against ±2 pp SESOI

## τ choice
- LongMemEval-S: τ = 0.40 (max_score p30 on dev split)
- LoCoMo:        τ = 0.47 (max_score p30 on dev split)

## Stopping rule
- No interim analysis; run to pre-specified n.
- If a sanity replicate (n<100) shows surprising effects, document in
  DEVIATIONS.md but do NOT change main protocol.

## Multiple comparisons
- 2 hypotheses × 1 primary endpoint = 2 tests; report all p values uncorrected,
  do not claim "corrected significance".

⭕ 预注册的最低标准：上面 8 个段必须在跑主实验前写完并 commit 到 git 或上传到 OSF。事后改任何一项，都要在 DEVIATIONS.md 里诚实记录。

1.3 哪些可以预注册、哪些不可以

可以预注册 ✅	不能预注册 ❌
H1/H2 的具体形式	”结果会很有趣”
主端点选择	”看哪个端点效果好”
sample size	”跑到显著就停”
τ 阈值	”调到最显著的 τ”
决策规则（pass/fail）	“看了再决定”
SESOI	”事后定义”

🌟 核心：预注册的不是结果，而是规则。结果可以失败（H1 不成立），规则不能改。

2. 主端点选 binary acc 还是 F1

2.1 两者的差异

binary acc = 1 if judge says "yes" else 0
F1         = 2 × precision × recall / (precision + recall)  on tokens

binary acc：粗糙，但抗 prompt sensitivity（“yes/no”由 judge 二判）
F1：细腻，但严重依赖 token-level 答案格式（同一意思不同表述 F1 差距大）

2.2 LoCoMo 上两者会打架的真实例子

系统	binary acc	token-F1
C1 (zero-build)	54.8%	28.1%
C4 (persistent cache)	54.8%	27.8%
Δ(C4-C1)	+0.00 pp	-0.30 pp

如果你预注册 binary acc 为主端点：H1b 测得 Δ=0、p=0.55。结论：失败。

如果你预注册 F1 为主端点：H1b 测得 Δ=-0.30、p=0.62。结论也失败，但报告的数字会让人误以为是 cache “稍微变差”，而不是”完全没区别”。

🌟 建议：优先 binary acc 为主端点，F1 为 secondary。理由：

binary acc 的 SESOI 更直观（+2 pp 就是有意义的）
不同 LLM 的 token-level 输出格式漂移大，F1 跨 backbone 比较失效
McNemar 配对天然作用于 binary outcome

2.3 论文怎么写主端点

abstract、intro、experiments 主表都用 binary acc；F1 在 experiments 子节”secondary endpoint”中报告。

不要：abstract 用 F1（看起来更精细），experiments 主表用 binary acc——这种切换会被审稿人抓”endpoint switching”。

3. McNemar 配对检验

3.1 直觉

两个系统在同一组 n 题上跑，每题一个 0/1 结果，得到 2×2 列联表：

                B 对  B 错
A 对   |  n11   n10
A 错   |  n01   n00

McNemar 关心的不是 n11/n00（两者一致的），而是 n10/n01（不一致的）。

🍎 直觉：n10 = “A 答错但 B 答对”的题数。n01 反之。如果两个系统等价，n10 和 n01 应该差不多。如果 n10 远大于 n01，那 B 真的比 A 强。

3.2 scipy 实现（精确二项）

from scipy.stats import binomtest

def mcnemar(a_correct, b_correct):
    """a_correct, b_correct 是同一组题的 0/1 list，两者长度相等且一一对应"""
    n01 = sum(1 for a,b in zip(a_correct, b_correct) if a==1 and b==0)
    n10 = sum(1 for a,b in zip(a_correct, b_correct) if a==0 and b==1)
    if n01 + n10 == 0:
        return {"n10": 0, "n01": 0, "p": 1.0}
    p = binomtest(min(n01, n10), n=n01+n10, p=0.5,
                  alternative="two-sided").pvalue
    return {"n10": n10, "n01": n01, "p": p}

# 例：本模块作者论文 H1b
res = mcnemar(c3_correct_list, c4_correct_list)
print(f"n10={res['n10']} n01={res['n01']} p={res['p']:.3f}")
# 输出: n10=8 n01=8 p=1.000  → C4 vs C3 完全等价

3.3 为什么不用 paired t-test

paired t-test 假设差值服从正态。但二元 outcome 的差值只能是 {-1, 0, +1}——这不是连续，更不是正态。McNemar 是 binary outcome 的正确配对检验。

⭕ 审计提示：如果论文里写”paired t-test”做 binary acc 比较，那作者大概率没受过统计训练，p 值不可信。

4. TOST 等价性检验

4.1 用途

McNemar 解决的是 “H1: 两者不同”。但**“接受 H0: 两者相同”**——也就是”等价”——需要另一个检验：TOST（Two One-Sided Tests）。

🧠 关键：p > 0.05 不等于”等价”！它只是”没证据说不同”。要主动声明等价，必须做 TOST。

4.2 怎么做（基于 paired bootstrap CI）

import numpy as np

def tost_via_bootstrap(a_correct, b_correct, sesoi_pp=2.0, n_boot=2000, ci=0.90):
    """
    若 1-α CI 完全落在 [-SESOI, +SESOI] 内 → 接受等价 H0。
    一般 TOST 用 1-α CI（不是 95%），α=0.05 对应 90% CI。
    """
    arr = np.array([(b-a)*100 for a,b in zip(a_correct, b_correct)])
    n = len(arr); rng = np.random.default_rng(42)
    boots = [arr[rng.choice(n, n, replace=True)].mean() for _ in range(n_boot)]
    lo, hi = np.percentile(boots, [(1-ci)/2*100, (1-(1-ci)/2)*100])
    equivalent = (-sesoi_pp <= lo) and (hi <= sesoi_pp)
    return {"ci_lo": lo, "ci_hi": hi, "sesoi_pp": sesoi_pp, "equivalent": equivalent}

# 例
res = tost_via_bootstrap(c3_list, c4_list, sesoi_pp=2.0)
# 若 90% CI = [-1.50, +1.20]，落在 ±2 内 → 等价
# 若 90% CI = [-3.20, +0.80]，跨过 -2 → 不等价（可能负向有意义差距）

4.3 怎么诚实写

不要写 “the two are equivalent”——TOST 的”接受等价”只能在 SESOI 限定下声明。

✅ 正确写法：

“The 90% paired-bootstrap CI for Δ(C4-C3) is [-1.20, +1.50] pp, falling within the pre-registered ±2 pp SESOI; we therefore cannot reject equivalence at this resolution.”

❌ 错误写法：

“C4 and C3 are statistically equivalent (TOST p > 0.05).“

5. Conversation-clustered bootstrap

5.1 问题：LoCoMo 的”题与题不独立”

LoCoMo 一个 conversation 有 ~199 道题，这些题共享同一段对话历史。也就是说，第 50 题答错很可能不是因为”系统弱”，而是这条对话的某个 session 信息不全——这个错误会污染同 conversation 的其他题。

如果用普通 bootstrap（在 n=764 题上 resample），等于假设题与题独立，CI 会人工偏窄。

5.2 解法：cluster bootstrap

def clustered_bootstrap(per_conv_results, n_boot=2000, ci=0.95):
    """
    per_conv_results: list of [delta_per_question_within_conv]
    在 conversation 层面 resample，再把 conv 内的题全部带上。
    """
    n_conv = len(per_conv_results)
    rng = np.random.default_rng(42)
    boots = []
    for _ in range(n_boot):
        chosen_convs = rng.choice(n_conv, n_conv, replace=True)
        all_diffs = np.concatenate([per_conv_results[c] for c in chosen_convs])
        boots.append(all_diffs.mean()*100)
    lo, hi = np.percentile(boots, [(1-ci)/2*100, (1-(1-ci)/2)*100])
    return {"ci_lo": lo, "ci_hi": hi, "mean": np.mean(boots)}

🌟 效果：本模块作者论文上，普通 bootstrap CI 是 [-2.1, +1.5]，clustered 版本是 [-5.84, +1.97]——CI 宽度几乎翻倍。如果你只报普通 bootstrap，等于人为放大显著性。

5.3 LongMemEval 不需要 cluster 吗

LongMemEval 是单查询/单 conversation，每题一个独立 conversation——题间天然独立，不需要 cluster。所以 H1a 用普通 paired bootstrap 即可，H1b 必须 clustered。

6. Cumulative-effect plot：抵御 optional stopping 指控

6.1 为什么需要

reviewer 心里的疑问：

“你说 H1a 在 n=400 上 Δ=-1.5 pp。但我相信你早期 n=50 上看到的是 +5 pp，所以你才有动力跑下去；如果早期是 -5 pp 你会偷偷停了。”

这就是 optional stopping——隐性挑选有利时刻报告。Cumulative-effect plot 是视觉反驳：把 Δ 估计作为 cumulative n 的函数画出来，让 reviewer 自己看趋势。

6.2 真实案例（本模块作者论文 P5）

Δ(C3-C1) (pp)
  ^
+5|     ●  (n=10, 早期噪声 Δ=+5)
+3|       ●●●
 0|----------●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●  (稳定)
-2|                               ●●● (n=400 final)
  +-----------------------------------> n
  10   50    100   200   300   400

清晰显示：

早期 n=10-50 因为方差大，Δ 在 [-7, +12] 之间漂移
n≈100 起 Δ 稳定在 -1.5 pp 附近
全部预注册 n=400 时 Δ=-1.5 pp、CI [-4.25, +1.25]

🌟 这张图本身就是”我们没偷停”的证据——任何想过早停的人都会停在 +5 pp 那里发论文，不会跑到 -1.5 pp。

6.3 实现

def plot_cumulative(c1_correct, c3_correct, n_boot=500, out_pdf="P5.pdf"):
    import matplotlib.pyplot as plt
    ns = list(range(10, len(c1_correct)+1))
    diffs, los, his = [], [], []
    rng = np.random.default_rng(42)
    for n in ns:
        a = np.array(c1_correct[:n]); b = np.array(c3_correct[:n])
        diff = (b.mean() - a.mean()) * 100
        diffs.append(diff)
        boots = [(b[rng.choice(n,n,replace=True)].mean()
                  - a[rng.choice(n,n,replace=True)].mean())*100
                 for _ in range(n_boot)]
        lo, hi = np.percentile(boots, [2.5, 97.5])
        los.append(lo); his.append(hi)

    fig, ax = plt.subplots(figsize=(10,5))
    ax.fill_between(ns, los, his, alpha=0.2, label="95% CI")
    ax.plot(ns, diffs, linewidth=2, label="paired Δ (cumulative)")
    ax.axhline(0, color="k", linewidth=0.7)
    ax.axhline(2, ls="--", color="r", alpha=0.7, label="preregistered +2 pp")
    ax.set_xlabel("paired n (cumulative)")
    ax.set_ylabel("Δ accuracy (pp)")
    plt.savefig(out_pdf, bbox_inches="tight")

6.4 谁应该看这张图

作者：放在 Appendix 或 §experiments 子节
reviewer：第一眼扫过去就能确认”没有 stopping bias”
后续读者：复现时能比对自己的 cumulative 曲线

7. DEVIATIONS.md：偏离预注册时怎么办

预注册不是契约，是承诺。但”跑实验时发现要改”是常态。关键是诚实记录每次偏离。

7.1 模板

# Deviations from EXPERIMENTS_PREREG.md

## Deviation 1: H1 → H1a + H1b split

**Original**: H1: persistent caching (C4) improves over zero-build (C1) by ≥ 2 pp.
**Actual**: Split into H1a (LongMemEval-S, single-query) and H1b (LoCoMo, multi-query).
**Reason**: LongMemEval-S is single-query per conversation; cache reuse is structurally
           impossible. We split before observing C3/C4 main results.
**Impact**: Conservative—both sub-hypotheses fail, so the split does not change conclusion direction.

## Deviation 2: τ recalibration on LoCoMo

**Original**: τ = 0.40 (LongMemEval-S p30) used across both benchmarks.
**Actual**: τ = 0.47 (LoCoMo p30) on LoCoMo runs.
**Reason**: LoCoMo's max_score distribution has higher central tendency than
           LongMemEval-S's; using LongMemEval-S τ would under-trigger.
**Impact**: Both τ values chosen on dev splits, not main eval. We additionally report
           τ-sensitivity sweep in Appendix B showing all τ ∈ [0.30, 0.55] give Δ ≤ 0.

## Deviation 3: H1a sample size 500 → 400

**Original**: n=500 paired.
**Actual**: n=400 paired (multi-session and temporal-reasoning categories truncated 133→83).
**Reason**: API budget; truncation removes items from precisely the categories where a
           cache primitive would be most expected to help, so this subset is
           CONSERVATIVE against our null finding.
**Impact**: Results would only get more null with full 500.

7.2 写 DEVIATIONS 的 3 条铁律

诚实第一：每个偏离都标 “Reason”，不要包装动机
影响方向：明说这个偏离让结论”更保守”还是”更激进”——前者无害，后者必须解释
保留原始：不要删 EXPERIMENTS_PREREG.md 改 DEVIATIONS.md，两份并存

7.3 reviewer 看 DEVIATIONS.md 的态度

偏离类型	reviewer 反应
Sample size 减小但保守方向	✅ 接受
τ 重新校准 + 在 dev split 上做 + 附 sensitivity 图	✅ 接受
主端点切换（acc → F1）	❌ 拒稿
删除”不显著的 backbone”	❌ 拒稿（cherry-pick）
跑完发现 H 太弱、改 H 形式	⚠️ Major Revision，要求重新预注册重跑

🌟 核心：保守方向偏离 + 诚实披露 = 加分；激进方向偏离 = 直接致命。

✅ 自我检验清单

预注册文件：能写出一份合格的 EXPERIMENTS_PREREG.md（hypotheses / endpoints / sample size / decision rules / SESOI）
主端点选择：能为自己的项目选 binary acc vs F1 并解释为什么
McNemar：能用 scipy.stats.binomtest 跑出配对 p 值，理解为什么不能用 paired t-test
TOST：能用 paired bootstrap 90% CI 实现 TOST 等价性检验
clustered bootstrap：能解释为什么 LoCoMo 评测必须做 conversation-level cluster
cumulative plot：能用 matplotlib 画出这张图，并解释怎么挡住 optional stopping 质疑
DEVIATIONS：能区分”保守方向偏离”和”激进方向偏离”，并诚实写出
三件套合规清单：能 1 分钟内对一篇论文给出”5 件证据”打分

📚 参考资料

概念入门

Pre-registration: A Discussion —— OSF / Center for Open Science：心理学领域的预注册流程，本章移植到 NLP
Optional Stopping —— Daniël Lakens 博客系列：解释 NHST 下”看着结果决定停止”如何膨胀 type-I error

关键论文

Equivalence Testing with TOST（Lakens, 2017）—— TOST 在 NLP 的迁移基础
本模块作者论文 EXPERIMENTS_PREREG.md / DEVIATIONS.md —— 完整可借鉴的真实模板
LongMemEval（Wu et al., ICLR 2025）：arXiv 2410.10813 —— 主端点 binary acc 的 official judge prompt 来源
LoCoMo（Maharana et al., 2024）：arXiv 2402.17753 —— token-F1 secondary 的 official metric

行业讨论

EMNLP Reproducibility Checklist —— 虽未强制预注册，但鼓励所有 RDF（researcher degrees of freedom）披露
Reviewer 关切样本（OpenReview） —— 搜索 “optional stopping” / “preregistration” 在近 2 年 NLP 评审中的频次，已显著上升

框架文档（如适用）

scipy.stats.binomtest：官方文档
scipy.stats.bootstrap：官方文档 —— 现成的 bootstrap CI（但不直接支持 cluster；建议手写）
OSF（Open Science Framework）：osf.io —— 免费预注册托管平台

搜索