第6章：Benchmark 论文 4 篇精读——LongMemEval / LoCoMo / MemBench / MEMTRACK

新人常犯的错：读 Mem0 / A-Mem 论文（“在 LongMemEval 上 SOTA”），但从没读过 LongMemEval 本身。结果遇到 oracle question_type / judge 自定义 / ability 类别不均衡这些坑时一脸懵。本章把 4 个主流 Memory benchmark 当一等公民深读：每个含设计意图、数据构造、ability 类别 / question type、官方 judge 实现、known limitations。读懂这些，你才能审 G3 论文的”benchmark 用法”是否合规，也才能为自己的论文选合适的 benchmark 组合。

1. LongMemEval（Wu et al., ICLR 2025）

1.1 基本信息

作者：Wu, Wang, Yu, Zhang, Chang, Yu
时间：2024.10
arXiv：2410.10813
状态：ICLR 2025 接收
GitHub：xiaowu0162/LongMemEval

1.2 设计意图

“现有 long-term memory benchmark 太少、太单调；我们造一个单查询、跨 session、6 ability 类的标准 benchmark。“

1.3 数据构造

500 questions（LongMemEval-S）/ 5000 questions（LongMemEval-M）
每题对应1 个 conversation（这是关键 — 单查询）
每个 conversation 含 ~50-300 turns、~115K token haystack
6 ability 类别（每题标 question_type）：

Ability	n in -S	描述
single-session-user	~70	答案在用户某句话中（同一 session）
single-session-assistant	~56	答案在助手某句回答中
single-session-preference	~30	用户偏好类（如”不喜欢香菜”）
multi-session	~133	答案需跨 session 综合
temporal-reasoning	~133	涉及时间推理
knowledge-update	~78	用户更新过的事实

1.4 官方 judge

每个 ability 类自带 judge prompt：

# 简化版
def judge(question, gold, pred, question_type):
    prompt = JUDGE_TEMPLATES[question_type].format(
        question=question, gold=gold, pred=pred)
    return llm_yes_no(prompt) == "yes"

🌟 关键：6 个 ability 各有不同 judge prompt（适配不同语义的”正确”）。许多论文用单一 judge 替代 → 这就是 Anatomy of Agentic Memory 的”judge 5 pp 不可比”来源。

1.5 数据特性

单查询 → cache reuse 结构上不可能（这是本模块作者论文 H1 splittling 的根本原因）
6 ability 不均衡（multi-session / temporal-reasoning 各 133，single-session-preference 仅 30）
question_type 标签可被 oracle 滥用

1.6 Known Limitations

单查询不适合测 cache / 持久化（→ LoCoMo 补位）
haystack 是人造的（不是真实对话历史）
6 ability 类的边界主观

1.7 30 秒 takeaway

LongMemEval-S = 500 题 + 6 ability + 单查询 + 官方 judge 的 G3 时代标准。每个 G3 论文都比。读这篇 = 知道 question_type 字段可以被 oracle 滥用 + 6 ability 不均衡。

2. LoCoMo（Maharana et al., 2024）

2.1 基本信息

作者：Maharana, Lee, Tulyakov, Bansal, Barbieri, Fang
时间：2024.02
arXiv：2402.17753
GitHub：snap-research/LoCoMo

2.2 设计意图

“补 LongMemEval 的单查询缺陷：长对话（每 conv 54 sessions）× 多查询（每 conv ~199 QA），测真正的 cache 命中。“

2.3 数据构造

10 conversations（人造的角色扮演对话）
每 conv：~54 sessions × ~18 turns / session
每 conv：~199 QA pairs，全部针对同一段对话历史
5 category：single-hop / multi-hop / temporal / adversarial / commonsense

2.4 官方 metric

Token-level F1（与 SQuAD 风格相同）：

F1 = 2 × precision × recall / (precision + recall)
     在 normalized tokens 上

normalize: 小写 + 去标点 + 去 stopword
多答案：split 后取最高 F1

2.5 数据特性

多查询 → cache 命中率 >70% 是必然（同 conv 内题间共享上下文）
但cache 命中 ≠ 答题更准（本模块作者论文 H1b 的核心发现）
F1 跨论文不可比：normalize 实现差异 + multi-answer 处理差异

2.6 Known Limitations

绝对 F1 在不同论文里跑出 24%-82% 极端不可比（同模块十九 Ch1 §2）
judge 实现差异（token-F1 vs 自家 LLM judge）
长 conversation 上 retrieval pipeline 强度差异极大

2.7 30 秒 takeaway

LoCoMo = 多查询 cache 测试床。但绝对 F1 数字不可比，paired Δ 才是真信号。读这篇 = 知道为什么单看”我们 F1 81%“是无信息的。

3. MemBench（2025）

3.1 基本信息

类型：跨多种 memory ability 的统一 benchmark
状态：2025 年提出，arXiv preprint

3.2 设计意图

“LongMemEval 单查询、LoCoMo 多查询，但都偏对话；MemBench 加入结构化 fact recall / preference consistency / temporal precision 三类新 ability。“

3.3 数据构造

~2000 题，覆盖 8 ability 类
每题含 ground-truth fact list（而非自然语言答案）
structured evaluation：exact match on fact-ID

3.4 官方 judge

structured exact-match（不需要 LLM judge）
抗”语义相似但不等”的水分

3.5 数据特性

评测更严格（exact match）
但与自然语言对话偏离较大 → 对话流畅的系统可能在 MemBench 上拿不到分

3.6 Known Limitations

数据量较小（2000 题）
structured 评测 over-penalize 灵活表达
工业界采用率低于 LongMemEval / LoCoMo

3.7 30 秒 takeaway

MemBench = structured fact eval，严格但与对话偏离。如果你测的是”事实记忆精确度”，用 MemBench；测”对话连贯”用 LongMemEval / LoCoMo。

4. MEMTRACK（2025）

4.1 基本信息

类型：long-running agent memory tracking benchmark
状态：2025 年提出

4.2 设计意图

“现有 benchmark 测’agent 是否记得’，不测’agent 是否记得自己的状态变化’（如’我答应过下周开会’→ 一周后该不该提醒）。MEMTRACK 把 agent 的自身状态轨迹作为 memory 内容。“

4.3 数据构造

多轮 task-oriented dialogue，每轮 agent 做出承诺 / 状态变化
query 测 agent 是否能 trace 自己的承诺
~500 多轮 task

4.4 官方 judge

LLM judge + state-aware metric（agent state 是否与 ground-truth trace 一致）。

4.5 数据特性

比对话 memory 更严格——要求 agent 跟踪自身状态
评测复杂（agent 视角 + state 视角）

4.6 Known Limitations

数据生成成本高（每条要标 state trace）
评测计算成本高（state diff）
与对话 benchmark 不可比

4.7 30 秒 takeaway

MEMTRACK = agent self-state memory，是离工业界”长期助理”最近的 benchmark。如果你做的是真助理产品，必读。

5. 4 benchmark 横向对比 + 怎么选

维度	LongMemEval	LoCoMo	MemBench	MEMTRACK
题量	500 (S) / 5000 (M)	~2000 (10 conv)	~2000	~500 task
查询模式	单查询 / conv	多查询 / conv	单查询	多轮 task
haystack	人造 ~115K token	人造 ~30 sessions	短上下文 + fact list	task dialogue
Judge	per-ability LLM	token-F1	exact match	state-aware
主测能力	long-term memory	多查询 cache + 长对话	structured fact	agent self-state
跨论文可比性	★★	★（绝对值漂移大）	★★★	★★
工业采用率	★★★	★★★	★	★
必读级	P0	P0	P1	P1

5.1 怎么选 benchmark

你想测什么	用哪个
长 haystack retrieval 的极限	LongMemEval
Cache 命中率 / 多查询复用	LoCoMo
精确事实记忆	MemBench
Agent 跟踪自身状态	MEMTRACK
组合：写 ARR 论文	LongMemEval + LoCoMo 双 benchmark（本模块作者论文做法）

🌟 建议：双 benchmark = 增强普适性主张，同时把单查询 / 多查询两类 trigger 都覆盖。

6. Benchmark 论文必读的 5 个理由

理解 judge：知道 official judge prompt = 知道 G3 论文里”自家 judge”偏离多少
识别 oracle：question_type / category 等字段是 oracle 滥用的源头
理解 limitations：知道 benchmark 自身的 noise（如 LoCoMo 绝对 F1 漂移）
选 benchmark：为自己论文挑组合
抗 reviewer：reviewer 问”为什么不在 X benchmark 上测”，你能解释为什么 X 不适合

⭕ 审计提示：一个 G3 论文如果完全没引用 benchmark 论文本身，只引用”用 benchmark 的论文”，是红灯——作者可能没读过 benchmark 设计文档。

✅ 自我检验清单

能讲清 LongMemEval-S 500 题 + 6 ability + question_type 字段
能讲清 LoCoMo 10 conv × ~199 QA × 54 sessions 结构 + token-F1 normalize 流程
能讲清 MemBench / MEMTRACK 各自的差异化设计意图
能为 4 个 benchmark 各列 2 个 known limitations
能解释 LongMemEval official judge vs token-F1 的差异
能为自己的论文选合适的 benchmark 组合
能在 G3 论文里识别 benchmark 使用上的 5 种红灯（judge 自定义 / oracle / 单 benchmark / F1 only / 无 paired）

📚 参考资料

概念入门

本路线第 1 章优先级矩阵 —— 4 个 benchmark 都在 P0/P1
模块十九第 1 章 —— benchmark 不可比性的实证讨论

关键论文（本章 4 篇主角）

LongMemEval（Wu et al., ICLR 2025）：arXiv 2410.10813
LoCoMo（Maharana et al., 2024）：arXiv 2402.17753
MemBench（2025） —— arXiv preprint
MEMTRACK（2025） —— arXiv preprint

行业讨论

HuggingFace datasets xiaowu0162/longmemeval —— 下载入口
OpenReview LongMemEval ICLR’25 评审 —— reviewer 关切样本

框架文档（如适用）

xiaowu0162/LongMemEval GitHub —— 含 official judge 实现
snap-research/LoCoMo GitHub —— 含 token-F1 official metric

搜索