Agent Memory 实证审计与负结果方法论 2026年5月11日

第7章：负结果论文的 5 种失败模式——避坑与诚实披露

mock-vs-prod gap、optional stopping、reversal under scaling、hidden oracle、cherry-picked baselines 五种最常见的负结果论文失败模式；每种给真实案例与对应的预防策略；含本作者一篇论文 P2 ablation 在 1-conv n=100 上 +8 pp p=0.011 但在 3-conv n=387 上反转到 -1.81 pp 的诚实记录

失败模式 optional stopping reversal hidden oracle cherry pick 诚实披露

负结果论文比正结果论文更难发——因为 reviewer 默认假设”是不是只是没调好”。本章把 5 种最常被 reviewer 揪住的失败模式逐一列出，给真实案例（包括本模块作者论文 P2 ablation 在 1-conv n=100 上 Δ=+8 pp p=0.011 但在 3-conv n=387 上反转到 -1.81 pp 的诚实记录），并给每种模式对应的预防策略和”撞上之后怎么诚实写”的模板。学会这 5 种识别 + 预防 + 披露，你的负结果论文就具备了”扛得住严苛审稿”的基本免疫力。

1. Mode-1：Mock vs Prod Gap

1.1 描述

在小样本 / 简化 pipeline / mock 数据上跑出的结论，在生产 / 全样本 / 真实 pipeline 上不成立。

1.2 真实案例

“我们在 50 题 mock LongMemEval 上验证了方法 X 提升 5 pp”——切换到 official 500 题 + official judge 后，提升变 0
“我们在自定义 GPT-4 judge 上 acc 82%“——切换到 LongMemEval official judge 后，acc 54%

1.3 怎么发生

mock 数据分布与真实分布偏差
简化 pipeline 缺关键混杂控制（如 hybrid retrieval）
自定义 judge 与 official judge 的 calibration gap

1.4 预防

✅ 从 day 1 用 official benchmark + official judge——哪怕只跑 50 题 ✅ 小样本预跑用 stratified sampling——保证类别覆盖 ✅ dev split / test split 严格分开——dev 用来调 hyperparameter，test 跑主实验

🌟 本模块作者论文教训：本模块作者早期 1-conv n=100 跑出 P2 ablation +8 pp，看起来 H 部分成立；切到 3-conv n=387 后反转。如果当时只发了 100 题数据，就是踩中 mock-vs-prod。

2. Mode-2：Optional Stopping

2.1 描述

跑实验跑到看到喜欢的数字就停手，跑到不喜欢的数字就继续跑（或重启）——隐性挑选最有利的样本量。

2.2 为什么是问题

NHST（null hypothesis significance testing）假设 sample size 是预先固定的。如果你”看着 p 值决定停”,p < 0.05 的真实概率会从 5% 膨胀到 20%+。

2.3 真实案例（不是论文，是社区常见）

Day 1: 跑 n=50, p=0.18, "再跑跑看吧"
Day 2: 跑 n=100, p=0.06, "差一点点了"
Day 3: 跑 n=150, p=0.03, "成了发论文！"

→ 这个 p=0.03 是”在 3 个停手点中挑最显著的”，真实 type-I error 远高于 0.05。

2.4 预防

预注册 sample size（第 4 章 §1）
跑前承诺”跑到 N 就停，不管结果”
报告 cumulative-effect plot（第 4 章 §6）让 reviewer 自己看趋势

2.5 撞上后怎么写

如果你确实在中途停了，诚实写：

## DEVIATIONS Deviation 4: Early termination

Original sample target: n=500.
Actual: n=300, terminated due to API budget.
At n=300, Δ = X pp, p = Y; cumulative-effect plot (Appendix A)
shows Δ stabilized after n≈150, supporting the conclusion is
not sensitive to the exact n.

🌟 核心：早停本身不致命，隐瞒早停才致命。

3. Mode-3：Reversal under Scaling

3.1 描述

小样本上正向 + 显著的效应，在大样本上反转或归零。

3.2 真实案例（本模块作者论文 P2 ablation）

样本量	Δ(C5 - C4)	p	当时的”故事”
1 conv, n=100	+8.0 pp	0.011	”displacement 是问题！分离通道 +8 pp 显著！“
3 conv, n=387	-1.81 pp	0.92	”无效果，cluster CI 完全跨过 0”

🌟 方向反转 + 显著性消失。如果作者只发 100 题数据，论文会得出错误结论”displacement 是 H1b 失败原因”；3 conv 重跑直接打脸。

3.3 怎么诚实写（本模块作者论文 §4.7 真实文本）

“P2 ablation reversal (transparent disclosure). A 1-conversation $n=100$ sanity run for the separate-channel ablation showed $\Delta(\text{C5}-\text{C4}) = +8$ pp with $p=0.011$ , suggesting displacement was the harm mechanism. The pre-committed 3-conversation main run ( $n=387$ ) reverses this to $-1.81$ pp ( $p=0.92$ ). We report both transparently in §4.7; the reversal underscores the importance of pre-registered scaling targets over post-hoc small-sample mechanism stories. The conclusion at main scale---no cache configuration helps---is the result we treat as canonical.”

要点：

同时报小样本和大样本
明说小样本 result 应被大样本 override
不要删小样本数据——它本身是”为什么需要预注册”的教学样本

🌟 核心：reversal 是 NLP 实证研究的常态。承认它，把它作为论文方法论价值的一部分写出来。

3.4 预防

预注册 sample size（同 Mode-2）
小样本 sanity 不是结论——只是”代码是否跑通”的测试
scaling 期间的中间结果不要写进 abstract / intro

4. Mode-4：Hidden Oracle

4.1 描述

不声明使用 benchmark 提供的 oracle 字段（question_type / answer_id / evidence_span），让结果看起来比实际更强。

4.2 真实代码模式（出自一篇 2026 真实论文）

# query 入口
def query(self, question, question_type=None):
    if question_type == "temporal-reasoning":
        chunks = self.temporal_index.retrieve(question, k=5)
    elif question_type == "multi-session":
        chunks = self.cross_session_summary.retrieve(question, k=3)
    else:
        chunks = self.default_retriever.retrieve(question, k=10)
    return self.llm.answer(chunks, question)

# evaluation 脚本
for qa in benchmark:
    pred = system.query(qa["question"], question_type=qa["category"])  # ← oracle

→ “adaptive routing” 实际是 oracle question_type routing。如果论文不声明，等于偷分 5-15 pp。

4.3 预防（如果你是作者）

跑实验时 evaluation 脚本只传 question，不传 metadata
如果设计上必须用 oracle（如本模块作者论文的 artifact_type 路由），明说它是 oracle 上限，不当成最终方案

4.4 诚实写法（本模块作者论文 §3）

“Artifact type is selected from the benchmark-provided question_type label, giving our trigger module oracle access to question type. This favors our condition; we report it as an upper-bound configuration.”

🌟 核心：oracle 不是罪，隐瞒 oracle 才是。

5. Mode-5：Cherry-Picked Baselines

5.1 描述

故意选弱的 baseline 让自己的方法看起来强，或者跑了多个 baseline 只报最弱的。

5.2 真实案例

“我们的方法相对 OpenAI memory baseline 提升 +20 pp”——但 OpenAI memory 是已知最弱 RAG baseline，相对 hybrid + bge-large 强 baseline 只提升 +1 pp
“我们的 LoCoMo F1 = 65%“——但用的 retriever 是 dense-only MiniLM，换 hybrid + bge-large 后 baseline 就到 60%

5.3 预防

✅ 至少跑 2-3 个 baseline，覆盖：

公认最强的 zero-build retrieval（hybrid + bge-large + RRF）
同期最 SOTA 的 Memory 系统（Mem0 / A-Mem / MemoryOS）
极简 baseline（SimpleMem 风格）

✅ paired 对比所有 baseline，不只是最弱的

5.4 撞上后怎么写

如果你的 strong baseline 跑下来发现自家方法没提升：这才是真正的实证发现。本模块作者论文就是这样的例子——发现”strong retrieval + LLM extract 反而不如 strong retrieval-only”。

6. 一份”反 Failure Mode” 自检表

跑实验前 / 投稿前，逐项打勾：

#	自检项	防的什么	我做了吗
1	用 benchmark 官方 judge prompt	Mock-vs-prod	☐
2	dev / test split 严格分开	Mock-vs-prod	☐
3	EXPERIMENTS_PREREG.md commit 在跑主实验前	Optional stopping	☐
4	主实验 sample size 预先固定	Optional stopping	☐
5	cumulative-effect plot 含在论文	Optional stopping	☐
6	小样本 sanity 结果不进 abstract	Reversal	☐
7	任何 reversal 诚实记录到 DEVIATIONS.md	Reversal	☐
8	evaluation 不用 question_type / metadata	Hidden oracle	☐
9	如果用 oracle，论文明说是 upper bound	Hidden oracle	☐
10	至少 2 个 baseline，含 strong zero-build	Cherry-pick	☐
11	paired 对比所有 baseline	Cherry-pick	☐
12	positive control 在 §experiments	综合（Mock + Cherry）	☐

🌟 每项都打勾才能投顶会。本模块作者论文 12 项全打勾后从 6.5/10 推到 7.0/10 Accept。

7. 撞上之后怎么诚实写

5 种 mode 都可能在跑实验过程中撞上。关键是怎么写。

7.1 黄金原则

不要删数据——所有跑过的 raw json 全部保留
不要换故事——预注册的 H 不能事后改，只能 fail
不要藏 outlier——单 backbone 反向、单样本反转，全部诚实写
加 DEVIATIONS.md 而不是改 PREREG.md

7.2 模板

## DEVIATION X: {模式名}

**Discovered**: {日期} when {跑 X 实验时}
**Original plan**: {原 PREREG 中的承诺}
**Actual result**: {实际跑出的}
**Direction**: {对结论是 conservative 还是 liberal}
**Action taken**: {你做了什么——通常是"按原 PREREG 跑完，不改方案"}
**Disclosure**: 论文 §X.Y 段落明说

7.3 reviewer 看 DEVIATIONS 的态度

你写的内容	reviewer 反应
”撞上 X，按预注册跑完，conservative 方向”	✅ 加分
”撞上 X，调整方案重跑，仍按预注册阈值判定”	⚠️ 可能要求重新预注册
”撞上 X，未披露，用 Y 数字”	❌ 拒稿 + 学术诚信调查

🌟 撞上失败模式不羞耻；藏起来才羞耻。

✅ 自我检验清单

5 种模式命名：能准确叫出 mock-vs-prod / optional stopping / reversal / hidden oracle / cherry-pick
真实案例：能为每种失败模式举一个真实案例（自己或他人）
预防策略：能为每种失败模式给一条具体的预防策略
诚实披露：当自己撞上 reversal / hidden oracle 时知道怎么在论文里写而不删数据
审稿人视角：能在 review 一篇负结果论文时定位上述 5 种模式
12 项自检：能在投稿前用 §6 的自检表过一遍
DEVIATIONS 模板：能写出合规的 DEVIATIONS 条目

📚 参考资料

概念入门

Replication Crisis (心理学领域 2010s) —— 5 种失败模式最早被系统化讨论的源头
本模块第 4 章 —— 用预注册三件套预防 Mode-2 / Mode-3 的工具

关键论文

本模块作者论文 §4.7 “P2 ablation reversal” —— Mode-3 reversal 的诚实披露真实样本
本模块作者论文 DEVIATIONS.md —— 完整的偏离记录模板
The Garden of Forking Paths（Gelman & Loken, 2013） —— researcher degrees of freedom 经典论述
A Manifesto for Reproducible Science（Munafò et al., 2017） —— 预注册 + 透明披露的领域共识

行业讨论

OpenReview 关于”reproducibility”的评审讨论 —— 近 2 年 NLP / ML 评审越来越关注
OSF（Open Science Framework）案例库 —— 心理学 / 经济学的真实预注册样本

框架文档（如适用）

scipy.stats —— 配合 §4 的统计三件套使用
本模块作者论文 figures/P5_cumulative.pdf —— Mode-2 防御的视觉模板

搜索