Agent Memory 实证审计与负结果方法论学习路线

Agent Memory 这个赛道在 2024-2026 两年里冒出 30+ 个号称 SOTA 的系统：MemGPT、A-Mem、Mem0、MemoryOS、LightMem、EMem、Memori、LiCoMemory、Nemori、SimpleMem、D-MEM…… 每篇论文都报”在 LongMemEval 上 acc +X pp”、“在 LoCoMo 上 F1 +Y pp”，但同一个 LongMemEval-S 在不同论文里跑出的绝对分数差能到 30 个百分点。读者一脸懵，审稿人一脸懵，工业界要落地的人最懵：到底谁的方法是真的有用，谁是 cherry-pick + 强 baseline 抑制 + 隐藏 oracle 路由凑出来的？

本路线不教你”造一套新的 Memory”——那是模块五 Agent Memory 的事。本路线教你 拿到一套别人的 Memory，怎么把它的设计选择、构造触发器、评测口径全部祛魅，并写出一篇自己的、能扛得住严格审稿的 Memory 实证论文。

作者基于自己一篇预注册 Agent Memory 负结果论文（11 系统 atlas + 4 LLM backbone + C+oracle positive control）的完整经验编写本系列。

🌟 全景概览：为什么 Memory 赛道亟需一套审计方法论

模块五 Agent Memory 教完后，你已经知道 Memory 是什么、有几类、怎么用主流框架搭。但当你打开 arXiv 看到一篇 2026 年新论文写：“本系统在 LongMemEval-S 上达到 82% accuracy，比 baseline 提升 38 pp”——你能说清楚：

它的 baseline 是不是被故意调弱了？
它的”提升”在配对 McNemar 检验下还显著吗？
它的 build trigger 是 input-driven 还是 output-driven？这条赛道还有多少未被探索的格子？
它公开了 oracle 路由没有？还是把 question_type 偷偷塞给了系统？
它有没有 positive control 证明自己的方法不是单纯靠 retrieval 增强？

🌟 一句话：能问出上面 5 个问题且给出工具回答它们的人，才算真正进入 Agent Memory 实证研究的门。

🍎 类比：模块五教你做菜，本模块教你当美食评论员——但不是嘴炮的那种，是带着秤、温度计、双盲对照组的那种。

🧠 本路线的三个核心洞察

洞察	一句话	哪一章讲透
洞察 1	Agent Memory 的所有”创新”都能拆进 `write-trigger × read-behavior` 两个轴的有限格子里	第 2 章
洞察 2	Trigger primitive 不是”越花哨越好”，而是”是否在你的 pipeline 中带来 paired binary acc ≥ +2 pp 显著提升”	第 3 章 + 第 4 章
洞察 3	负结果论文要发出去，positive control 比 ablation 重要十倍——它直接堵死”你的 pipeline 本来就废”这条退路	第 5 章

📖 章节导览

章	主题	核心问题	关键工具 / 论文
1	Memory 系统为什么需要”审计”	同一个 benchmark 跑出 30 pp 差距的元凶是什么	LongMemEval / LoCoMo 跨论文对比、Pareto 图
2 ⭐	Agent Memory 三年演进与 11 系统 atlas	三代演进 + 12 个里程碑 + 11 系统代码级精读 + 5 个真正未解决的开放问题 + 方法论演进	Mem0、A-Mem、MemoryOS、LightMem、EMem、Memori、LiCoMemory、Nemori、SimpleMem、D-MEM、Selective Memory
3	Construction Trigger Primitive 全景	6 类 trigger primitive 的可证伪假设与对应实验设计	input-driven、output-driven、failure-driven、scheduled、hybrid、oracle
4	预注册 + 配对评测：怎么避免 cherry-pick	TOST 等价性、McNemar 配对、conversation-clustered bootstrap、cumulative-effect plot	EXPERIMENTS_PREREG.md 模板、scipy.stats
5 ⭐	Positive Control 设计	负结果论文的”反证防线”：gold-answer 注入、extractive verbatim、oracle evidence	C+oracle 模板代码
6	Backbone 鲁棒性与跨族验证	4 LLM 族（Anthropic / OpenAI / DeepSeek / Qwen）方向一致 null 的实证规范	OpenAI 兼容代理、USTC LLM 中转
7	负结果论文的失败模式	mock-vs-prod gap、optional stopping、reversal under scaling、隐藏 oracle	DEVIATIONS.md 模板
8	端到端实战：复现一篇 Agent Memory 负结果审计	一键脚本跑 atlas + 4 backbone H1a + LoCoMo H1b + C+oracle	locomo_oracle_control.py

🔧 核心方法论速查

拿到一篇 Agent Memory 论文
  │
  ├─ 第 2 章：定位它在 atlas 哪个格子
  │     └→ write-trigger × read-behavior 两轴
  │
  ├─ 第 3 章：识别它的 trigger primitive
  │     └→ 把"创新"翻译成可证伪假设 H：Δ paired acc ≥ +2 pp 在 95% CI 排除 0
  │
  ├─ 第 4 章：检查统计纪律
  │     ├─ 是 paired 还是 unpaired？
  │     ├─ 主端点是 binary acc 还是 F1？预注册了吗？
  │     ├─ TOST / McNemar / cluster bootstrap 报全了吗？
  │     └─ 有没有 cumulative-effect plot 防 optional stopping
  │
  ├─ 第 5 章：找 positive control
  │     └→ 没有的话，"测试床本身能不能检测到正向效应"就是开放问题
  │
  ├─ 第 6 章：跨 backbone 一致吗
  │     └→ 单 backbone 显著 ≠ 方法论显著
  │
  └─ 第 7 章：找失败模式
        ├─ 是否存在 oracle 路由？
        ├─ benchmark 绝对分数和官方相差多少？
        └─ 是否在小样本逆转过结论？

📅 必读资料时间线

年份	工作	对本路线的贡献
2024.10	LongMemEval (ICLR’25)	单查询 long-term memory 评测的事实标准
2024.02	LoCoMo	多查询、长对话 memory 评测；H1b 缓存命中率所在
2025-2026	Mem0 / A-Mem / MemoryOS / LightMem / Memori / LiCoMemory / Nemori / D-MEM / Selective Memory / SimpleMem / EMem	11 系统 atlas 的原料
2025.07	MemOS (Memory OS for AI)	“把 memory 抽象成系统”的并行 taxonomy 工作
2026.02	Anatomy of Agentic Memory	评测局限的 empirical analysis
2026.03	Diagnosing Retrieval vs Utilization	揭示 retrieval 解释 20 pp、write strategy 仅 3-8 pp

🚪 新人破局指南

你是研究生 / 准备发 EMNLP-ARR / NeurIPS Memory 论文：按 1→2→3→4→5→6→7→8 顺序读；第 4-5 章配套实操代码先在小数据上跑通。

你是工业界落地工程师：跳读 1→2→4→6→8，第 4 章只看 paired binary acc + McNemar，第 6 章重点看 OpenAI 兼容代理那部分。

你是审稿人 / 阅读组：只读 1→3→5→7。这四章给你”5 分钟看穿一篇 Memory 论文有没有水分”的快速 checklist。

📚 参考资料

概念入门

Agent Memory 学习路线（本仓库模块五） —— Agent Memory学习路线.md：先把 Memory 是什么搞清楚，再来做审计
Anatomy of Agentic Memory（Anonymous, 2026）：arXiv 2602.19320 —— 与本路线最互补的 empirical analysis 综述

关键论文

LongMemEval（Wu et al., ICLR 2025）：arXiv 2410.10813 —— 500 题、6 能力类、单查询；本路线 H1a 主战场
LoCoMo（Maharana et al., 2024）：arXiv 2402.17753 —— 10 conv、~199 QA/conv、多查询；本路线 H1b 主战场
Diagnosing Retrieval vs Utilization（Yuan et al., 2026）：arXiv 2603.02473 —— retrieval 解释 20 pp，write strategy 解释 3-8 pp 的关键 prior
MemOS（Li et al., 2025）：arXiv 2507.03724 —— 并行的 memory taxonomy 工作

行业讨论

Pre-registration in NLP（Annual Meeting workshops） —— 持续讨论 NLP 实证研究预注册的理由与做法
OpenReview 公开评审记录 —— 找 LongMemEval / LoCoMo 相关论文的 review，看真实的 reviewer 关切

框架文档（如适用）

scipy.stats.binomtest / mcnemar —— 配对显著性检验官方 API
OpenAI-compatible API（DeepSeek / 中科大 LLM 中转 / 火山引擎 ARK） —— 跨 backbone 实验的廉价方案

这套路线服务于对 Agent Memory 论文写作和审阅有严肃需求的读者。如果你只想跑一个 Memory demo，请回到模块五。

搜索