第8章：端到端实战——复现一篇 Agent Memory 负结果审计

把第 1-7 章的工具链拼成一条可执行的流水线：从 baseline 系统克隆、benchmark 数据下载，到 4 backbone 配对评测、positive control 注入、cumulative-effect 图绘制、LaTeX 论文骨架。本章每一步给可复制脚本和预期产物（json / pdf / 表格）；预算 ~$5（DeepSeek 直连 + USTC 中转 + ARK 试用） + 1 周时间，可以从空目录跑出一篇 EMNLP / ARR-quality 的负结果实证论文。这是把前面 7 章学到的方法论落地的”周末实战项目”。

📑 目录

1. 环境准备（Day 0）
2. 11 baseline 系统克隆与 atlas 化（Day 1）
3. 数据集下载与切分（Day 1）
4. M2：trigger AUC 与 risk-coverage（Day 2）
5. H1a：LongMemEval-S 4 backbone paired（Day 3）
6. H1b：LoCoMo 主 backbone 全量 + sanity（Day 4）
7. C+oracle positive control（Day 5）
8. P5 cumulative-effect plot（Day 5）
9. 论文骨架（LaTeX ACL 模板）（Day 6-7）
10. 一周时间表与预算
✅ 自我检验清单
📚 参考资料

1. 环境准备（Day 0）

1.1 目录结构

agent_mem_audit/
├── README.md
├── EXPERIMENTS_PREREG.md
├── DEVIATIONS.md
├── motivation/
│   ├── m2_trigger_auc.py
│   ├── persistence_ablation.py
│   ├── locomo_ablation.py
│   ├── locomo_oracle_control.py
│   └── figures/
├── baselines/                ← clone 11 SOTA repos
├── datasets/
│   ├── LongMemEval/
│   └── LoCoMo/
└── paper/
    ├── main.tex
    └── custom.bib

1.2 依赖

# Python 3.10+，无需 GPU
pip install openai sentence-transformers rank_bm25 \
            scikit-learn scipy pandas matplotlib \
            "httpx[socks]"

模型缓存（首次自动下载）：

模型	大小	用途
`BAAI/bge-large-en-v1.5`	1.3 GB	强 baseline embedder
`BAAI/bge-reranker-base`	280 MB	reranker
`sentence-transformers/all-MiniLM-L6-v2`	90 MB	默认 embedder

1.3 LLM API 准备

3 个互为 fallback 的 OpenAI-兼容代理：

# 1. DeepSeek 直连（旗舰 + flash）
export DEEPSEEK_API_KEY=sk-...
export DEEPSEEK_BASE_URL=https://api.deepseek.com

# 2. USTC LLM 中转（免费、覆盖 deepseek-v4-pro / qwen3.6-chat）
export USTC_API_KEY=sk-...
export USTC_BASE_URL=https://api.llm.ustc.edu.cn/v1

# 3. 火山引擎 ARK（claude-3-5-sonnet）
export ARK_API_KEY=...
export ARK_BASE_URL=https://ark.cn-beijing.volces.com/api/coding/v3

🌟 预算心智：单条 H1a paired (n=400) ≈ 800 次 LLM 调用 ≈ $0.5；4 backbone 全跑 ≈$ 4。LoCoMo H1b full (n=764) ≈ $1。**总预算 ~$ 5 跑完所有主实验**。

2. 11 baseline 系统克隆与 atlas 化（Day 1）

2.1 一键克隆脚本

mkdir baselines && cd baselines
for repo in \
  "mem0ai/mem0" \
  "wujiangxu/A-Mem" \
  "BAI-LAB/MemoryOS" \
  "..."; do
  git clone --depth=1 "https://github.com/$repo.git"
done

⚠️ 实际 repo 名以第 2 章 §4 atlas 卡片中的 GitHub URL 为准；部分系统未开源，跳过。

2.2 atlas 卡片模板

为每个 repo 写一份 baselines/{system}_atlas.md：

# {System Name}

## Write trigger
- 类型：[input-driven / output-driven / failure-driven / scheduled / hybrid / oracle]
- 关键代码：`path/to/file.py:function()`
- 摘录：
  ```python
  # 决定何时写的关键 5-10 行

Read behavior

类型：[retrieval-only / adaptive / generative-on-read]
关键代码：path/to/retrieve.py:retrieve()

Build cost

每 conversation：~XK tokens（实测 / 论文报告）
是否使用 oracle 字段：是/否（哪一项）

Paired-evaluation status

是否有 paired McNemar：是/否
是否有 zero-build C1 baseline：是/否
是否有 positive control：是/否


### 2.3 atlas 化的产出

11 张卡片汇总到 `motivation/mechanism_atlas.md`，作为论文 §2 Related Work 的源材料。

---

## 3. 数据集下载与切分（Day 1）

### 3.1 LongMemEval-S

```bash
cd datasets
git clone https://github.com/xiaowu0162/LongMemEval.git
cd LongMemEval && bash download.sh  # 拉 500 题 + ~115K token haystack

500 题已自带 6 个 ability 类标注。不要事先 split 成 dev/test——直接预注册在 dev 上选 τ，然后在 test 上跑主实验。

3.2 LoCoMo

git clone https://github.com/snap-research/LoCoMo.git
# data/locomo10.json 含 10 conv × ~199 QA

切分预注册：

H1b primary: 5 conv × all QA = 764 paired
H1b sanity: 1 conv × 100 QA on secondary backbone
C+oracle: 1 conv × 100 paired QA on tertiary backbone

4. M2：Trigger AUC 与 risk-coverage（Day 2）

4.1 跑 M2 baseline

cd motivation
HF_HUB_OFFLINE=1 python3 m2_trigger_auc.py \
  --n_questions 500 \
  --backend deepseek_pro_fast \
  --output figures/M2_full_500.json

输出：每题的 dense / RRF score、是否答对、question_type 标签。

4.2 画 ROC + risk-coverage

python3 figures/plot_M2_roc.py --input figures/M2_full_500.json
python3 figures/plot_M2_riskcov.py --input figures/M2_full_500.json

预期：max_score AUC ≈ 0.64（整体）；per-ability 0.78（saturated）→ 0.53（multi-session）。

🌟 审计意义：AUC 0.64 < 预注册 0.65 viability threshold → “trigger 信号弱”是后续 H1 失败的先验。

5. H1a：LongMemEval-S 4 backbone paired（Day 3）

5.1 主 backbone 全量

python3 persistence_ablation.py \
  --condition C3 --n 500 --backend ark_claude \
  --tau 0.40 --tau_signal max_score \
  --output figures/ablation_C3_500.json --stratified

5.2 三个 sanity backbone

for bk in lab_gpt54 deepseek_pro_fast ustc_qwen36; do
  python3 persistence_ablation.py \
    --condition C3 --n 100 --backend $bk \
    --tau 0.40 --output figures/ablation_${bk}_C3_100.json
done

5.3 配对分析

跑第 4 章 §3 的 McNemar + §5 的 paired bootstrap 脚本，输出 figures/H1a_4backbones_summary.json：

{
  "claude":   {"n": 400, "delta_pp": -1.50, "p": 0.88, "ci": [-4.25, 1.25]},
  "gpt54":    {"n":  96, "delta_pp": -5.21, "p": 0.99, "ci": [-9.38, -1.04]},
  "deepseek": {"n":  96, "delta_pp": -1.04, "p": 0.75, "ci": [-6.25,  4.17]},
  "qwen36":   {"n":  96, "delta_pp": -1.04, "p": 0.77, "ci": [-6.25,  4.17]}
}

🌟 预期结果：4 个 backbone 方向一致 null。Δ 范围 [-5.21, -1.04] pp，全部不达 +2 pp 阈值。

6. H1b：LoCoMo 主 backbone 全量 + sanity（Day 4）

6.1 主 backbone 全量（C1 / C3 / C4）

for cond in C1 C3 C4; do
  python3 locomo_ablation.py \
    --condition $cond --n_conv 5 --backend ark_claude \
    --tau 0.47 --tau_signal max_score \
    --output figures/locomo_${cond}_full5.json
done

6.2 P2 ablation（C5 / C6）

# C5: 分离通道 (k_raw=10 + k_artifact=2 = 12)
python3 locomo_ablation.py --condition C5 --n_conv 3 \
  --backend deepseek_pro_fast --k_raw 10 --k_artifact 2 \
  --output figures/locomo_C5_3conv.json

# C6: budget-matched (k_raw=8 + k_artifact=2 = 10, 同 C4)
python3 locomo_ablation.py --condition C5 --n_conv 3 \
  --backend deepseek_pro_fast --k_raw 8 --k_artifact 2 \
  --output figures/locomo_C6_budget_3conv.json

6.3 跨 backbone H1b sanity

for bk in lab_gpt54 deepseek_pro_fast; do
  for cond in C1 C3 C4; do
    python3 locomo_ablation.py --condition $cond --n_conv 1 \
      --max_qa_per_conv 100 --backend $bk \
      --output figures/locomo_${bk}_${cond}_sanity.json
  done
done

🌟 预期：H1b primary 主 backbone Δ(C4-C3) ≈ +0.00 pp on binary; sanity backbone 上方向不一致。

7. C+oracle positive control（Day 5）

参考第 5 章 §3 完整模板：

HF_HUB_OFFLINE=1 python3 locomo_oracle_control.py \
  --n_conv 1 --backend ustc_dspro \
  --output figures/locomo_oracle_1conv.json

7.1 配对分析

第 5 章 §3.3 的脚本，输出预期：

Δ = +20.00 pp on n=100
McNemar: n10=21 n01=1 p=1.10e-05
95% CI: [+11.00, +29.00] pp

🌟 审计意义：把这一行写进 abstract → 论文从”6/10 Almost”推到”7/10 Accept”。

8. P5 cumulative-effect plot（Day 5）

参考第 4 章 §6 完整代码：

python3 figures/plot_P5_cumulative.py \
  --c1 figures/M2_full_500.json \
  --c3 figures/ablation_C3_500.json \
  --output figs/P5_cumulative.pdf

预期：Δ 在 n=10-50 漂移 [-7, +12]，n≥100 稳定 -1.5 pp。

9. 论文骨架（LaTeX ACL 模板）（Day 6-7）

9.1 用 ACL Rolling Review 模板起手

git clone https://github.com/acl-org/acl-style-files.git
cd paper && cp ../acl-style-files/latex/* .

9.2 章节模板（8 页主体）

\documentclass[11pt]{article}
\usepackage[review]{acl}

\title{Cache Hits Don't Save Agent Memory: A Pre-Registered Audit of Construction Trigger Primitives}

\begin{document}
\maketitle

\begin{abstract}
% Abstract: ~300 words, 必含
% 1. Mechanism atlas (11 systems)
% 2. H1a: 4 backbones, 全部 null
% 3. H1b: full-scale primary + 2 sanity, null
% 4. P2 ablation: 4 cache variants 全部聚集 2.1 pp
% 5. Positive control: +20 pp McNemar p=1.1e-5
\end{abstract}

\section{Introduction}                         % 1 页
\section{Related Work: A Mechanism Atlas}     % 1.5 页（来自第 2 章 atlas）
\section{Method: A Test Bed for Construction Triggers}  % 1.5 页
\section{Experiments}                         % 3 页
  \subsection{Datasets and protocol}
  \subsection{M2: Trigger AUC and risk-coverage}
  \subsection{H1a: ephemeral trigger fails on all four backbones}
  \subsection{H1b: persistent cache amortizes but does not transfer}
  \subsection{Strong-pipeline replication}
  \subsection{Backbone robustness}
  \subsection{P2 ablation: no cache configuration helps}
  \subsection{Positive control: cache mechanism works when content is good}
  \subsection{Cumulative-effect analysis}
\section{Discussion}                          % 0.5 页
\section{Limitations}                         % 0.5 页
\section{Conclusion}                          % 0.3 页
\bibliography{custom}
\appendix
  \section{Cumulative Effect Analysis}
  \section{$\tau$ Sensitivity Sweep}
\end{document}

9.3 必须配齐的 5 件证据

#	证据	落点
1	Pre-registration	EXPERIMENTS_PREREG.md + 论文 §3 footnote
2	4 backbone H1a paired	§4.4 Table + 图
3	C+oracle positive control	§4.8 独立子节 + abstract
4	Cumulative-effect plot	Appendix A
5	DEVIATIONS log	DEVIATIONS.md + 论文 §3 footnote

10. 一周时间表与预算

Day	任务	API 花费	累计
0	环境 + API key	$0	$0
1	11 baseline atlas + 数据下载	$0	$0
2	M2 AUC + risk-coverage	$0.5	$0.5
3	H1a × 4 backbone	$1.5	$2.0
4	H1b primary 5 conv + sanity	$1.5	$3.5
5	C+oracle + P5 plot	$0.3	$3.8
6	论文骨架 + abstract	$0	$3.8
7	Round 1 GPT-5.4 review + 修订	$0.5	$4.3

🌟 关键里程碑：

Day 3 末：4 backbone H1a 跑齐 → 知道是不是要继续（如果发现 +5 pp 反向，停下来重新预注册）
Day 5 末：positive control 数字到手 → 决定 abstract 怎么写
Day 7 末：第 1 轮 review 跑通 → 可以投 ARR

10.1 失败 fallback

你撞上的	怎么办
API 503 / 限流	切到 USTC 中转或 DeepSeek flash
LongMemEval 下载失败	用 HuggingFace 镜像 `xiaowu0162/longmemeval`
LoCoMo F1 跑出 24% 远低于 SOTA	不动；这就是你”paired-internal”主张的来源
H1a 第一个 backbone 跑出 +5 pp	暂停，记录到 DEVIATIONS.md，重新预注册再继续；不要乘势”看起来要赢”偷停

✅ 自我检验清单

环境：能在新机器上 30 分钟内装好所有依赖
atlas：能为新 Memory 系统在 1 小时内补一个 atlas 条目
paired：能在 4 backbone 上跑 H1a paired 并报齐 McNemar / TOST / cluster bootstrap
positive control：能跑完 C+oracle 并写入 abstract
图表 5 件：能产出 Pareto + ROC + risk-coverage + cumulative + 4-backbone 汇总
论文骨架：能基于 ACL 模板写出 8 页主体 + Limitations + Appendix
预算：能把 API 总花费控制在 $10 以内
诚实失败处理：在 H1 看似要赢时知道不能偷停

📚 参考资料

概念入门

本模块第 1-7 章 —— 本章每一步对应的方法论
ACL Style Files：acl-org/acl-style-files —— 论文模板

关键论文

本模块作者论文 motivation/ 全部脚本 —— 本章 §4-§8 直接复用的实现
LongMemEval（Wu et al., ICLR 2025）：arXiv 2410.10813
LoCoMo（Maharana et al., 2024）：arXiv 2402.17753

行业讨论

ARR (ACL Rolling Review) 投稿流程 —— 本章 7-day 实战的最终去处
OpenReview ARR 公开评审记录 —— 找类似主题的论文 review，预演自己会被问什么

框架文档（如适用）

HuggingFace Hub：huggingface.co —— 模型 / 数据集托管
OSF（Open Science Framework） —— 预注册文件托管
DeepSeek API 官方：platform.deepseek.com
USTC LLM 中转：内部资源（参考 ~/.claude/CLAUDE.md 全局配置）

搜索