Agent Memory 实证审计与负结果方法论 2026年5月11日

第5章：Positive Control 设计——负结果论文的反证防线 ⭐

为什么负结果论文必须有 positive control；三种典型设计：gold-answer 注入 / extractive verbatim cache / oracle evidence；C+oracle 完整代码模板；论文中如何报告 +20pp McNemar p=1.1e-5 这类结果；reviewer 角度的 positive control 价值与局限

positive control 负结果 gold answer oracle ARR EMNLP

负结果论文最常见的拒稿理由不是”结果错”，而是”你的 pipeline 是不是本来就废”。reviewer 心里那条质疑很难反驳：万一是你的 retriever 太弱、judge 太严、prompt 不到位，所以谁来都跑不出正向？Positive control 就是一份”用同一个 pipeline 跑出确定正向效应”的证据——用它堵住那条退路。本章把三种最常用的 positive control 设计讲透：哪种适合什么场景、怎么 50 行代码实现、怎么把 +20pp 写进 abstract。这是把一篇 6/10 的负结果论文推到 7/10 Accept 的关键变量。

1. 没有 positive control 的负结果论文为什么发不出去

1.1 reviewer 的真实心理模型

正结果论文：reviewer 假设作者作弊（cherry-pick / 调弱 baseline），花精力找作弊证据。

负结果论文：reviewer 假设作者无能（pipeline 太弱），花精力找无能证据。

🍎 类比：你说”我们试了 3 种新型咖啡豆，喝起来都和白开水没差”——评委心里第一反应是”是不是你舌头出问题了”，不是”咖啡豆真的没差”。你需要先递一杯确定有味道的咖啡（positive control），证明舌头没坏，再说 3 种新豆没差才有可信度。

1.2 一份没有 positive control 的负结果论文的命运

reviewer 1：
  "结果合理但 pipeline 强度未证实。Major Revision，
   要求与 SOTA Memory 系统直接对比。"

reviewer 2：
  "作者承认 LoCoMo F1 24% 远低于 Memori 报告 81.95%，
   说明 pipeline 至少弱了 50 pp。基于这样弱的 pipeline 得出
   '缓存机制无效'的结论缺乏说服力。建议拒稿。"

meta-reviewer：
  "Reject. 负结果论文需要 positive control 证明测试床敏感性，
   作者未提供。"

🌟 真实数据：本模块作者论文从 6.5/10 推到 7.0/10 Accept，最关键的单一改动是加 C+oracle——其他所有 abstract 改写、scope 收紧、limitation 更新加起来推动幅度小于这一项。

1.3 positive control 的定义

A positive control is an intervention that, on the same pipeline / same backbone / same evaluation protocol, provably produces a substantively positive effect when fed an input that obviously should work.

要点：

同 pipeline、同 backbone、同 protocol——任何切换都让对比失效
设计上预期正向——不是为了发现什么，是为了证明 pipeline 能发现东西
效应必须够大——SESOI 的至少 5 倍，让 reviewer 没空间质疑”是不是 noise”

2. 三种典型 positive control 设计

按”伪造程度”从高到低：

2.1 Type-1：Gold-Answer 注入（C+oracle）⭐

做法：在 query 时，把 benchmark 提供的 gold answer 文本作为一个高优先级 chunk 拼进 retrieval context，标签 [ORACLE FACT]。

期望效果：accuracy 应该接近 100%（如果 LLM 能读懂 gold answer），最少应该比 zero-build baseline 高 15+ pp。

适用场景：

测试 cache delivery 通道是否有效
测试 judge 是否能识别正确答案
测试 prompt 模板是否能让 LLM 用上 retrieval context

优点：实现最简单（< 30 行代码）、效应最强、对比最干净。

缺点：注入的不是”系统能产出的内容”——它只能证明”如果系统产出 gold-quality 内容，pipeline 能传递”，不能证明”系统真的能产出 gold-quality 内容”。

本模块作者数据：LoCoMo, deepseek-v4-pro, n=100 → Δ = +20.00 pp, McNemar p=1.1e-5, 95% CI [+11, +29]。

2.2 Type-2：Extractive Verbatim Cache（C+verbatim）

做法：在 cache 中存储的不是 LLM-extracted artifact，而是检索到的 chunk 原文（verbatim）。后续查询时优先走 cache。

期望效果：因为内容无信息损失，相对 zero-build baseline 至少持平、相对 LLM-extracted artifact 应有正向。

适用场景：

测试是否是”LLM 抽取这一步”导致信息损失
把 information density 假设变成可证伪命题

优点：比 oracle 更”现实”——这是系统真的能做的；用来分离”抽取损失”和”机制损失”。

缺点：实现复杂度中等；需要管理 cache 大小（verbatim 占空间多）；效应未必 +20pp 那么大（一般 +3-8pp）。

适合谁：你的论文核心论点是”LLM 抽取这一步是瓶颈”，那 verbatim 控制就是直接命中。

2.3 Type-3：Oracle Evidence（C+evidence）

做法：benchmark 通常给每个 QA 标注了 evidence span（哪几个 turn / 哪几个 session 包含答案）。把这些 evidence span 直接作为检索结果，跳过 retriever。

期望效果：accuracy = retriever 的”理论上限”。如果 retriever 已经接近这个上限，说明 retrieval 不是瓶颈；否则说明你应该先改 retriever。

适用场景：

retrieval 强度审计（你的 hybrid+rerank 离 oracle 多远）
把 retrieval 作为 confounder 完全排除

优点：揭示的是 retriever 上限，不是 cache 上限；对”retrieval 解释 20pp”先验提供独立验证。

缺点：需要 benchmark 提供 evidence span（LongMemEval 有，LoCoMo 部分有，MEMTRACK 没有）；实现需要解析数据集 metadata。

2.4 三类对比决策表

你想证明什么	用哪个 control
Pipeline / judge / prompt 模板是 OK 的	Type-1 oracle 注入（最简单）
LLM 抽取这一步是瓶颈	Type-2 verbatim cache
Retriever 不是瓶颈	Type-3 evidence span
三类都做	见过的最严谨的负结果论文（推荐但工作量大）

🌟 推荐起手：先做 Type-1，因为它最简单且最能堵住 reviewer。如果还想加，按”工作量 - 论文加分”性价比依次加 Type-3 和 Type-2。

3. C+oracle 完整代码模板

下面这个模板是本模块作者论文真实使用的 50 行实现，已开源在 motivation/locomo_oracle_control.py。

3.1 子类化现有的 C1 baseline

"""LoCoMo Positive Control (C+oracle)"""
import sys, json, time
sys.path.insert(0, "PATH_TO_motivation/")
from m2_trigger_auc import (
    TOP_K, BACKENDS, HybridRetriever, LLMClient, generate_answer,
    EMBEDDING_MODEL,
)
from persistence_ablation import C1_Raw
from locomo_ablation import load_locomo, chunk_locomo_conversation, judge_locomo


class CPlus_OracleArtifact(C1_Raw):
    """注入 GOLD answer 作为 [ORACLE FACT] chunk，与 C1 raw retrieval 拼接"""
    name = "C+oracle"

    def answer_with_oracle(self, question, gold):
        # 1. 同 C1：top-K hybrid 检索
        top_idx, _, _, _ = self.retriever.retrieve(question, top_k=TOP_K)
        retrieved = [self.retriever.chunks[i] for i in top_idx]
        # 2. 把 gold 作为高优先级 chunk 前置
        ctx = [{"content": f"[ORACLE FACT] {gold}",
                "chunk_id": "oracle",
                "source_session_id": "oracle"}] + retrieved
        ans = generate_answer(self.llm, ctx, question)
        self.query_calls += 1
        return ans, {"oracle": True, "retrieved_k": len(retrieved)}

⭕ 关键设计：[ORACLE FACT] 这个 marker 让 LLM 在生成 prompt 里看到一个与 retrieval chunks 视觉上区分的标签。如果你直接拼裸文本，效应会衰减（LLM 不知道这是”金标”）。

3.2 主循环

def main(args):
    llm = LLMClient(args.backend)
    retriever = HybridRetriever(EMBEDDING_MODEL)
    cond = CPlus_OracleArtifact(llm, retriever)

    convs = load_locomo(args.n_conv, args.max_qa_per_conv)
    results, n_correct, n_total = [], 0, 0
    for ci, conv in enumerate(convs):
        chunks = chunk_locomo_conversation(conv)
        cond.reset_for_new_conversation()
        cond.ingest_chunks(chunks)
        for qi, qa in enumerate(conv.get("qa", [])):
            q, gold = qa.get("question",""), qa.get("answer","")
            if not q or not gold: continue
            try:
                pred, _ = cond.answer_with_oracle(q, str(gold))
                correct = judge_locomo(llm, q, gold, pred)
            except Exception as e:
                continue
            n_correct += int(correct); n_total += 1
            results.append({
                "conv_idx": ci, "qa_idx": qi,
                "question": q[:120], "gold": str(gold)[:120],
                "pred": pred[:120], "correct": int(correct),
            })

    json.dump({"condition":"C+oracle", "n_total":n_total,
               "accuracy":n_correct/max(n_total,1), "results":results},
              open(args.output,"w"), indent=2)

3.3 配对分析（McNemar + bootstrap CI）

C+oracle 跑完后与 C1 baseline 配对：

import numpy as np
from scipy.stats import binomtest

oracle = json.load(open("locomo_oracle_1conv.json"))["results"]
c1     = json.load(open("locomo_C1_sanity.json"))["per_conversation"][0]["results"]

oracle_map = {r["qa_idx"]: r["correct"] for r in oracle}
c1_map     = {r["qa_idx"]: r["correct"] for r in c1}
common = sorted(set(oracle_map) & set(c1_map))

# 1. 配对 Δ
o_acc = sum(oracle_map[i] for i in common) / len(common)
c_acc = sum(c1_map[i] for i in common)     / len(common)
print(f"Δ = {(o_acc-c_acc)*100:+.2f} pp on n={len(common)}")

# 2. McNemar
n01 = sum(1 for i in common if c1_map[i]==1 and oracle_map[i]==0)
n10 = sum(1 for i in common if c1_map[i]==0 and oracle_map[i]==1)
p = binomtest(min(n01,n10), n=n01+n10, p=0.5, alternative="two-sided").pvalue
print(f"McNemar: n10={n10} (oracle wins) n01={n01} (c1 wins) p={p:.2e}")

# 3. Paired bootstrap 95% CI
rng = np.random.default_rng(42)
arr = np.array([oracle_map[i] - c1_map[i] for i in common])
boot = [arr[rng.choice(len(arr), len(arr), replace=True)].mean()*100
        for _ in range(2000)]
lo, hi = np.percentile(boot, [2.5, 97.5])
print(f"95% CI: [{lo:+.2f}, {hi:+.2f}] pp")

🌟 真实输出（本模块作者数据）：

Δ = +20.00 pp on n=100
McNemar: n10=21 (oracle wins) n01=1 (c1 wins) p=1.10e-05
95% CI: [+11.00, +29.00] pp

4. 报告格式：怎么把 +20 pp 写进论文

positive control 的报告点位有 4 处：abstract、introduction、experiments（独立子节）、conclusion。4 处都不写，positive control 的价值只发挥 30%。

4.1 Abstract（30 字以内）

“A positive control that injects gold-answer text through the same cache channel lifts accuracy by $+20.00$ pp (McNemar $p = 1.1 \times 10^{-5}$ , $n=100$ ), confirming the cache delivery mechanism works when content quality is high.”

要点：

紧跟”H1 fails”陈述后；不要单开段落
只用 1 个数字（Δ）+ 1 个 p 值 + 1 个 n
用”confirming … when content quality is high”这种条件化措辞，不要说”validates our central claim”

4.2 Introduction（半段，置于 contributions 前）

把 positive control 作为论文 5 件证据中的第 4 件：

“We additionally provide a positive control that injects benchmark-provided gold answers into the same cache channel; the resulting $+20$ pp lift establishes that the test bed can detect a positive cache effect when one exists, removing the most natural alternative explanation for the H1 null.”

4.3 Experiments §X.Y（独立子节，约 200 字 + 1 段统计）

参考本模块作者论文 §4.8 的写法：

§4.8 Positive control: cache mechanism works when content is good

A natural concern with a negative-result paper is whether the test bed itself can detect any positive cache effect. We add a positive control (C+oracle) on the same LoCoMo pipeline (deepseek-v4-pro backbone): at query time, alongside the C1 raw retrieval, we inject the gold answer text as a synthetic high-priority artifact (labeled [ORACLE FACT]). This isolates the cache-delivery mechanism (artifact channel) from cache-content quality. On the first 100 paired QA from one conversation, $\Delta(\text{C+oracle} - \text{C1}) = +20.00$ pp (95% bootstrap CI $[+11.00, +29.00]$ , McNemar $p = 1.1 \times 10^{-5}$ , $n_{10}=21$ , $n_{01}=1$ ). The artifact slot, when fed high-quality content, lifts accuracy decisively.

4.4 Conclusion（点睛 1 句）

“…a positive control injecting gold-answer text through the same cache channel lifts accuracy by $+20.00$ pp, establishing that the artifact slot is sensitive when content quality is high.”

🌟 报告格式一句话：positive control 的目的不是”展示我们做了一个实验”，而是”在 reviewer 阅读 abstract / intro / experiments / conclusion 任意一处时都能立刻看到这条证据”——4 处冗余写到位，少 1 处都打折扣。

5. positive control 的局限与诚实写法

⭕ 不要替 reviewer 想结论：positive control 不能证明的事，不要写。

5.1 它能证明什么

测试床能检测到正向效应（核心）
prompt 模板能让 LLM 用上 retrieval context 中的金标信息
judge 能识别正确答案

5.2 它不能证明什么

❌ 不能证明”系统能产出 gold-quality 内容”
❌ 不能证明”artifact-content 是唯一瓶颈”（可能还有其他 confounders）
❌ 不能证明”在所有 backbone 上都成立”（你只测了一个）
❌ 不能证明”在 H1b 主要 backbone 上也成立”（如果你 control 用的是 secondary backbone）

5.3 诚实写法（本模块作者论文 §4.8 收尾）

“Combined with the H1a/H1b null on LLM-synthesized artifacts, this is consistent with artifact-content quality being a---and likely the---binding constraint within the tested family, though we caution this single-conversation control does not exhaustively isolate the full persistence/retrieval path or rule out interactions with cache-mechanism design.”

要点：

“consistent with”而不是 “proves”
“within the tested family”限定 scope
“we caution”主动暴露局限

6. 与 ablation 的关系——互补不是替代

很多人把 positive control 和 ablation 混为一谈。它们功能完全不同：

维度	Ablation	Positive Control
目的	拆解组件贡献	证明 pipeline 敏感
方向	减法（去掉 X 看效应）	加法（注入 X 看效应）
期望	效应可正可负	必须正且大
reviewer 看	”组件设计是否合理"	"结果是否可信”
论文位置	§experiments 主表	§experiments 独立子节

🧠 核心洞察：ablation 是”看自己的肌肉”，positive control 是”证明自己有脉搏”。对正结果论文，ablation 够；对负结果论文，必须两个都有。

✅ 自我检验清单

必要性：能解释为什么负结果论文需要 positive control 而正结果论文一般不需要
三种设计：能区分 gold-answer 注入 / extractive verbatim / oracle evidence 各自适用场景
代码实现：能基于 §3 模板在 1 小时内为自己的 pipeline 写出 C+oracle
统计三件套：能用 scipy.stats.binomtest 跑 McNemar、用 numpy 跑 paired bootstrap CI
4 处报告：能把 positive control 结果写进 abstract / intro / experiments / conclusion 4 个位置
局限：能列出 positive control 不能证明的 4 件事
与 ablation 关系：能用一句话说清两者的功能差异

📚 参考资料

概念入门

Positive vs Negative Controls in Experimental Design（生物医学领域基础）—— 本概念在 NLP 领域刚开始普及，但生物医学已实践数十年

关键论文

本模块作者论文 §4.8 “Positive control: cache mechanism works when content is good” —— 含 +20 pp Δ 与 McNemar p=1.1e-5 的完整报告样本
Diagnosing Retrieval vs Utilization（Yuan et al., 2026）：arXiv 2603.02473 —— 提供”retrieval 解释 20 pp”的先验，是 Type-3 evidence control 的理论支持

行业讨论

OpenReview reviewer 关切（搜索”positive control”在 NLP / ML 论文 review 中的出现频率） —— 自 2024 年起急剧上升，是当下 reviewer 共识
EMNLP / ACL 的 reproducibility checklist —— 虽未明确要求 positive control，但 “test for sensitivity of method to alternative configurations”实质等价

框架文档（如适用）

scipy.stats.binomtest：官方文档 —— McNemar 配对显著性检验
本模块开源代码 motivation/locomo_oracle_control.py —— 可直接复制并改名跑你自己的 pipeline

搜索