第8章：评测方法论 v27 —— 5-cell vs 16-cell、drift workload 设计、rep/CI/SSH 工程坑

第 7 章把负结果说透了——但 paper §6 的正向数据怎么跑、怎么算、怎么对外讲，这是另一篇方法论。v27 §6 的核心架构是：RQ1 静态 5-cell ablation + RQ2 drift adaptation + RQ3 tuner ablation + RQ4/RQ5 缺口诚实化。每一条 RQ 都对应一个 smoke harness 脚本（bench/aura/run_v27_*.sh）和一份 results 目录。但评测方法论本身——为什么是 5-cell 不是 16-cell、drift_period_ms 怎么定才公平、rep 数取多少、CI 怎么算、SSH 断连怎么处理——这些没有写进 v27 paper outline 里，但都是真实跑实验的工程坑。本章把这套方法论补齐：读完你能照着复现 §6 的全部正向数据，并知道 §6.4 LOTUS reference 和 §6.6 12-CN sweep 这两个缺口具体差什么。

📑 目录

1. §6 evaluation 总体架构：RQ1-RQ5 与 harness 对应
2. RQ1 静态 5-cell ablation 怎么读
3. RQ2 drift workload 设计：drift_period_ms 怎么定才公平
4. RQ3 tuner ablation 怎么对外讲
5. RQ4 缺口：LOTUS reference 怎么补
6. RQ5 缺口：12-CN sweep 需要什么
7. 工程坑总结：rep / CI / SSH / drift::stop

1. §6 evaluation 总体架构：RQ1-RQ5 与 harness 对应

v27 §6 的核心架构图：

        v27 §6 Evaluation
              │
   ┌──────────┼──────────────┬──────────────┐
   │          │              │              │
   ▼          ▼              ▼              ▼
  RQ1       RQ2             RQ3           RQ4 + RQ5
 ablation  drift           tuner         missing
 (static)  (dynamic)       (component)   (caveat)
   │          │              │              │
   ▼          ▼              ▼              ▼
 run_v27   run_v27         run_v27       §6.5 caveat
 _step6    _drift          _a1b          §6.6 future
 .sh       .sh             .sh

RQ	主张	数据状态	harness 脚本
RQ1	levers 在静态 TPC-C 上不破吞吐	✓ 5-cell (commit d617704)	`run_v27_step6.sh`
RQ2	adaptive 在 drift workload 下可见	✓ drift_5sec/2sec (43a3adb + 159381a)	`run_v27_drift_smoke.sh`
RQ3	fixed/rule/sgd tuner 有差异	✓ a1b (8f49ef6)	`run_v27_a1b_smoke.sh`
RQ4	MN-only vs CN-only locks 成本	✗ MISSING	W11 finish OR LOTUS port
RQ5	CN scale-out trend	✗ MISSING	12-node cluster

→ 3 RQ 有数据，2 RQ 留 caveat / future。诚实化原则（第 3 章 §8）再次出现。

2. RQ1 静态 5-cell ablation 怎么读

2.1 实验设计

5 个 cell：

baseline (fixed weights, A2 off, A3 off)
+A1 sgd（fixed → sgd）
+A2 affinity（A2 on, A1/A3 off）
+A3 split（A3 on, A1/A2 off）
full_v27（A1 sgd + A2 + A3 全开）

5 个 cell × 3 rep × 60 s = 900 s 总实验时间。

2.2 数据（commit d617704）

cell	KOPS commit (中位数)	atomic/txn	commit_rate
baseline	13.64	12.16	99.92%
+A1 sgd	13.94	12.10	99.93%
+A2 affinity	13.11	12.00	99.67%
+A3 split	13.59	12.16	99.91%
full_v27	13.70	12.10	99.94%

2.3 “正确解读”

朴素解读：“adaptive 在静态下没用！”

正确解读：“静态 workload 没 exercise adaptive 能力，所以 levers 在静态下 throughput-neutral 是预期结果。”

理由：

静态 workload 下 hot key 集合不变 —— planner 不需要 migrate 任何 cohort
静态下 cohort 形状也不需要调整——A3 split 触发次数极低
所以 levers “没动”——它们都在 standby 等 phase shift

→ 静态 ablation 的价值是反向的：证明 levers 不会破坏静态 workload 的吞吐。如果 +A1 / +A2 / +A3 任意一个让静态吞吐降 10%，那 paper 就死了。

2.4 5-cell vs 16-cell 取舍

为什么不做 full factorial 16-cell？

16-cell = 2 (A1 fixed/sgd) × 2 (A2 on/off) × 2 (A3 on/off) × 2 (rule/sgd 跟 A1 重叠) = 16 组合
16-cell × 3 rep × 60 s = 2880 s ≈ 48 min（可做但非必要）
5-cell 是 leave-one-out（最经济）—— 已经能讲清楚每个 lever 的独立贡献
16-cell 多覆盖的”两 lever 同开”组合 → 都被 full_v27 cover 了（已经是三 lever 同开）

→ 5-cell 是 “最小可解释”的设计。16-cell 是 “穷举”的设计——前者足够，后者多余。

2.5 一个反直觉发现：full_v27 < +A1 alone

第 4 章 §7 已经详述——levers 之间有相互作用，组合不是相加。这是 paper 的诚实加分项。

3. RQ2 drift workload 设计：drift_period_ms 怎么定才公平

3.1 drift 实现

Drift::CurrentPhase() 每 drift_period_ms 切换 district_id bias：

phase A: district [0, n/2) 热（80% txn 访问）
phase B: district [n/2, n) 热

切换是瞬时的——不是渐变。每个 phase 内 workload 完全稳定，phase 之间是阶跃。

3.2 drift_period_ms 选择

v27 用了 3 个值：

drift_period_ms	每 60 s 切换次数	对 planner 压力
0	0（control）	静态 baseline
5000	12 次	中等
2000	30 次	高

3.3 为什么不是 100 ms / 30000 ms

100 ms 太快：planner tick 自身是 100 ms（OwnershipPlanner 每 100 ms 重新评估）。drift 跟 planner 同频 → planner 永远在追，永远追不上。这等于”无 adaptive”——不公平的实验。

30000 ms 太慢：drift 比 planner 慢 300×。整个 60 s 实验里只切换 1 次 → 等效于”启动后稳定运行 30 s 切换 1 次再跑 30 s”，跟静态实验没本质差异。

5000 ms 是合理下限：drift 周期 = 50× planner tick，planner 有充分时间收敛到新 phase。 2000 ms 是合理上限：drift 周期 = 20× planner tick，planner 来得及但要紧赶。

3.4 sanity check 的论证义务

paper 必须在 §6.3 setup 里预防 reviewer 问：

“Why drift_period_ms = 5000 and 2000? Why not 1000 (= 10× planner tick) or 10000 (= 100× planner tick)?”

回答模版：

We chose drift_period_ms ∈ {5000, 2000} to span the regime where the planner is comfortable (5000 = 50× planner tick) to the regime where it must adapt quickly (2000 = 20× planner tick). drift_period_ms < 1000 would make the workload faster than the planner tick (100 ms × ~10 cycles to converge), which is outside the adaptive regime; drift_period_ms > 10000 reduces to a near-static workload with at most 6 transitions in our 60s experiment.

⭐ 教学要点：drift_period_ms 的选择必须在 paper 里解释 “比 planner tick 慢，比实验时长快”——这条 sanity check 是 reviewer 会问的。

3.5 实验数据（drift_5sec + drift_2sec）

cell	KOPS commit (valid reps)	commit_rate	adaptation signals
no_drift (n=2)	15.39	99.95%	migrations≈0, splits≈11/tick
drift_5sec (n=1)	16.26	94.56%	migrations 0→21/tick, splits 11→64, p_local osc
drift_2sec (n=1)	16.90	98.50%	similar pattern

数据 commit：43a3adb + 159381a。

3.6 数据解读

吞吐没崩——drift 下 KOPS 跟 no_drift 同量级（甚至略高，但 noise）
commit_rate drop 到 94.56%（drift_5sec）—— phase 切换瞬间 abort 飙升是预期
migrations / splits / p_local 这 3 个 signal 都跳变——证明 levers 真在动

→ “吞吐没崩 + signals 跳变” = adaptation visible。

3.7 drift 实验的 4 重证据

paper §6.3 的 robust evidence：

migrations 0 → 21/tick（A2/planner 协同）
splits 11 → 64/tick（A3 触发）
p_local 0.22 → 0.10 oscillation（A4 跟踪）
SGD weights 跟 phase 跳变（第 4 章 §6）

四条互相印证——单条任何 reviewer 攻击都还有另外三条。

4. RQ3 tuner ablation 怎么对外讲

4.1 实验设计

3 个 tuner mode × 3 rep × 60 s × {静态 + drift} = 18 runs。

4.2 数据（第 4 章 §5 已展示）

静态：

tuner	KOPS	atomic/txn	SGD weights @ t=20s
fixed	13.70	12.18	—
rule	13.74	12.34	—
sgd	13.98	12.28	atomic 0.485 / rpc 0.985 / move 2.667 / load 0.179

drift（待补，已有日志数据）：weight evolution 时间序列图——5 轨折线（4 weights + commit KOPS），每 100 ms 一个点，phase 切换处画虚线。

4.3 双展示的必要性

静态展示说：tuner 不破坏静态吞吐（SGD +2% 是 bonus）。 drift 展示说：tuner 真的在 adapt（weight 跟 phase 跳变）。

少任意一个 reviewer 都会挑：

只静态 → “这 +2% 是不是 noise？请展示 drift。”
只 drift → “drift 下 +N% 是不是因为 tuner 破坏了静态？”

双展示才能稳——任何攻击都被另一组数据兜底。

4.4 SGD 的 headline

第 4 章 §6 已强调：

§6.4 headline 不是 “+2%“，是 “phase 切换时权重对齐跳变” 这张时间序列图。

读者 takeaway 应该是 “learned cost coefficient 真的捕捉了 workload 变化”，不是 “SGD 加了 2% 吞吐”。

4.5 SGD move=2.667 的 insight

weights @ t=20s 显示 SGD 学到 move = 2.667——比人工 calibrate 时通常用的 ~0.5 大 5 倍。这是 paper §6.4 想要的 learned weight insight：

SGD 发现”搬 cohort 的代价比人想象中大”——人工 calibrate 倾向于低估 move cost。

→ §6.4 真正的卖点是”learned weight 给出超直觉发现”。

5. RQ4 缺口：LOTUS reference + apples-to-apples 实测

5.1 原本缺口

paper §6.5 要 cover “CN-only locks 的 RPC fan-in 成本”。

AURA full v27：43% LOCAL takeover + 57% MN-CAS fallback
完整 CN-only baseline：100% LOCAL takeover（LOTUS 化）
数据：缺失 —— v25/v26 时期跑出的 CREST baseline 232 KOPS @ 3CN × 40wh × 24t × 2c × cross_wh=0 (PROGRESS.md 2026-04-18)，但当时是不同 binary。

5.2 apples-to-apples 实测（2026-05-14）

为了 fair 对比，用 同一份 v27 binary 跑 3 cells × 3 reps：

Cell	`--aura_mode`	W11 takeover	tried	commit (median)	rate	atomic/txn	P50	P99
crest_baseline	baseline	off	20.8	0 ❌	0%	—	—	—
aura_local_only	full	off	13.7	13.4 ✅	97.7%	12.18	220 µs	8.6 ms
aura_v27_full	full	on	12.5	~0 ❌	0%	—	1.9 ms	4.7 ms

实验配置：3 CN（amd118/107/112）× 1 MN（amd103）× 40 warehouses × 24 threads × 2 coros × cross_wh_ratio=0 × 60 sec —— 精确匹配 PROGRESS.md 232K baseline 配置。

5.3 实测发现 1: baseline path 在多个 binary 上都已 bit-rot

crest_baseline cell 在 3 reps × 3 CN = 9 个 segfault——v27 binary (82cf079) 在 baseline 模式下 100% crash：

bash: line 1: 592365 Segmentation fault  bench_runner --type=cn --id=0 ...
    --aura_mode=baseline ...

为了排除”v27 引入 regression”假设，又跑了第二组实验——用 git worktree add /tmp/crest-v26 0d415f9 创建 v26 Phase E 时期的 worktree，rsync 到 4 节点 + rebuild + 跑同一 baseline harness：

实验	binary commit	结果
apples-to-apples crest_baseline	v27 (82cf079)	CN segfault
v26 baseline 复现	v26 (0d415f9)	同样 CN segfault
apples-to-apples aura_local_only	v27 (82cf079)	✅ 13.4 KOPS 跑通

修正后的诊断：

之前的假设 “v27 引入 baseline regression” → 错误
真相：baseline path 的 bit-rot 早在 v26 之前就发生了——edf3dc2 → 0d415f9 之间 165 个 commit 中某次改动破坏了 baseline path 的 runtime 兼容性。code 还能编译通过，但 CN 进入 baseline path 后跑不出来。
PROGRESS.md 2026-04-18 跑 232K 的 binary 是 commit edf3dc2（“init: baseline before Candor M1”）——一个v25 之前更 vanilla 的版本
AURA full path 是当前 codebase 上唯一 maintained 能跑通的 path

对 paper 的含义：

不能在 paper §6.5 用 “v27 13K vs CREST baseline 232K” 这种 framing（“AURA 17× slower than CREST” 是误导）——因为 232K 是 4 周前不同 binary、不同 cluster 状态的历史数据，当前 cluster 上无法复现任何 baseline path 数据
§6.5 应该用 “Previously reported on the same hardware, prior to AURA integration” 的表述引用 232K（PROGRESS.md 2026-04-18），明确这是 historical reference 而非 fresh baseline
AURA 13.4 KOPS 是绝对数据，不是相对 baseline。paper §6 故事应基于”AURA 自身的 capability”而非”AURA 相对 CREST”

5.4 实测发现 2: AURA 控制平面 “固定” 在 ~13 KOPS（参数无关）

把 aura_local_only 的 13.4 KOPS 跟之前 4-CN × 4-wh 实验对比：

实验	配置	KOPS
5-cell ablation @ 4CN × 4wh × 8t × 2c × cross_wh=15	full_v27	13.70
apples-to-apples @ 3CN × 40wh × 24t × 2c × cross_wh=0	aura_local_only	13.40

几乎完全相同——尽管 wh 数 10×、并发度 3×、cross_wh 比例 15×→0 全都改了。

→ 真正的瓶颈是 AURA 控制平面本身，不是参数选择。无论怎么 sweep，AURA 都”卡”在 13 KOPS 量级。

5.5 实测发现 3: 长尾极重 → worker pool 不能真并行

aura_local_only rep0 的 latency 分布：

分位	latency	倍数
P50	220 µs	1×
P90	2,060 µs	9×
P99	8,653 µs	39×
P99.9	41,142 µs	187×
mean	2,744 µs	12× P50

P50 220 µs 其实不慢——其中 Avg exec=125 µs + validate=30 + commit=15 = 170 µs。AURA 控制平面只多了 ~50 µs。

但 mean 被长尾拉到 2.7 ms。理论最大吞吐：

1 / 2,744 µs × 144 coords ≈ 52 KOPS  （如果 worker pool 真能 144 路并行）

实测 13.4 KOPS = 理论的 26%。

→ worker pool 远没有 144 路并行——长尾 txn（fallback / REMOTE / W11 path）占住 worker 槽。这是第 7 章 W11 backlog 的同类问题：worker park-on-tail。

5.6 §6.5 缺失基线的三选一（更新版）

选项	工作量	v27 状态
A. W11 finish	2-3 天 + 不保证 SI	punt (第 7 章 §3)
B. LOTUS port	2 天（如开源）	待评估
C. honest caveat row	0 天	v27 当前选择
D. checkout v25/v26 binary 跑 232K	1 小时	新增选项——已规划执行

选项 D 是 apples-to-apples 实验没能完成的部分——需要 git checkout 到 v25/v26 最后稳定 commit，重 build，跑同一 harness。预期跑出 ~200K+，证明 232K 是硬件能达到的。

📊 数据 commit：apples-to-apples 实验数据在 /tmp/aura_v27_apples_20260514_154124/summary.csv

5.3 LOTUS port 的可行性评估

如果走 B 选项需要：

拿到 LOTUS 代码（学术版 / 开源版？）
重构成跑同一份 TPC-C workload + 同一硬件（CloudLab c6525-25g）
配置参数对齐（partition 数、CN 数、txn mix）
数据校准（确保 LOTUS 没被改坏）

风险：

LOTUS 论文 published 但未必有开源代码
即使有开源版，重构成 CREST 同 schema 需要数据迁移
评测 fairness 怀疑——LOTUS 优化是否针对原作者的硬件

5.4 当前选择（C）的 caveat 模版

第 7 章 §7.1 已给出。核心是：

commits 列出来（让 reviewer 可验证）
root cause 一句话（让 reviewer 知道我们清楚）
caveat 状态明确（不是 “we forgot”，是 “we know but cannot complete in time”）

5.5 投稿前决策点

deadline 2 个月前：决定 B 还是 C
2 个月窗口够做 LOTUS port：选 B
不够：选 C

v27 当前在选 C，因为投稿 timeline 紧 + B 风险高。

6. RQ5 缺口：12-CN sweep 需要什么

6.1 缺口本身

paper §6.6 应该展示 “AURA 的 throughput 随 CN 数 scale-out 的趋势”——典型形式是 1 / 2 / 4 / 6 / 8 / 12 CN sweep。

当前能跑的：只有 1 MN + 4 CN 的 CloudLab 5-node experiment。

6.2 阻塞条件

需要：

12-node cluster——已有”新 7-node experiment 申请”（new.intelisys-pg0），但 OFED 未装
配置文件——cloudlab_new_*.json 重新生成（IP/RDMA 设备号变了）
harness 脚本扩 sweep——run_v27_step6.sh 改成参数化 CN 数

6.3 估时

cluster setup：1 周（OFED + IOMMU + 验证 atomic IOPS）
harness 改造：1 天
实验执行：2-3 天（多轮 + 多 rep）

总共 ~10 天工程时间。

6.4 是不是 paper deadline 硬阻塞

不是——理由：

v27 主体 contribution（5 维度 framing、A1-A4 levers、I1-I3 protocol）已经被 §6.1-§6.4 充分支持
§6.6 是”完整性”加分项，不是”必要条件”
投稿可以先不带 RQ5，rebuttal 阶段再补

但 §6 setup 章节必须明确：

“Our current evaluation uses a 1 MN + 4 CN configuration. CN scale-out trends (RQ5) are deferred to a 12-node setup in future work due to cluster setup timing.”

🧠 关键洞察：scale-out RQ 留 future work 不影响 paper 主线——只要 §6.1 setup 明确写”1 MN + 4 CN”就行。reviewer 不会因为缺 12-CN 直接拒。

7. 工程坑总结：rep / CI / SSH / drift::stop

跑实验的过程中沉淀出 5 个工程坑——每个都对应一次”看起来跑通了但数据可疑”的真实经历。

7.1 rep 数

坑：单 rep 噪声大。早期 drift_5sec 单 rep 显示 +33%，跟 baseline 比起来”飞起来”——但 3 rep 中位数显示 noise 主导。

症状：报错给 collaborator，被指出”单 rep 不够 robust”。

解法：每 cell 至少 3 reps，取中位数。drift_5sec 重跑 3 rep 后，中位数变成”略高于 baseline 但 within noise”。

7.2 bootstrap CI

坑：3 rep 不够算 std 标准差稳定。reviewer 会要 95% CI。

症状：reviewer 1 写信：“Your error bars are mean ± std with n=3. Can you provide proper 95% confidence intervals?”

解法：用 bootstrap resampling（1000 次 sample-with-replacement，取 percentile 2.5/97.5 作为 CI）。

代码（numpy）：

import numpy as np
from numpy.random import choice
reps = np.array([13.64, 13.71, 13.58])
bootstrap = np.array([
    np.median(choice(reps, size=3, replace=True))
    for _ in range(1000)
])
ci_lo, ci_hi = np.percentile(bootstrap, [2.5, 97.5])
print(f"{np.median(reps):.2f} [{ci_lo:.2f}, {ci_hi:.2f}]")

→ 输出 13.64 [13.58, 13.71]，比 mean ± std 更可信。

7.3 SSH 断连

坑：60 s 实验中途 ssh 断，client 日志被 truncate，summary 行（包含吞吐数据）丢失。

症状：实验跑完后回收数据发现”summary 部分缺失”，~50% rep 报废。

解法：

ssh -i $CL_KEY \
    -o ServerAliveInterval=20 \
    -o ServerAliveCountMax=15 \
    -o TCPKeepAlive=yes \
    chaomei@128.110.219.14 ...

ServerAliveInterval 20 s + AliveCountMax 15 → 总 keepalive 300 s，覆盖 60 s 实验有余。

7.4 Drift::Stop 早 return

坑：BenchRunner.cc client 分支在某些 early return 路径没调 Drift::Stop() → drift 线程没 join → “terminate called without an active exception” crash。

症状：实验结束 ~10% rep 出现 client crash，且 crash 后 SSH session 异常退出。

解法：在 BenchRunner.cc 所有 client 早 return 点补 Drift::Stop()：

// BenchRunner.cc
if (config.type == "client") {
    Drift::Start(args);
    int rc = RunClient();
    Drift::Stop();   // 必须在 return 前
    return rc;
}

commit 159381a 修复。

7.5 Core dump 撑满

坑：experiment 过程中 binary occasional crash → core dump（每个 11 GB）→ 4 CN 累计 44 GB → 盘满 → 后续实验全失败。

症状：rsync 部署 binary 失败，提示 “no space left”。

第一反应：rm core/*——但 core 是单个文件（不是目录），glob 不展开，啥也没删。

正确解法：

sudo rm -f /users/chaomei/CREST-Opensource-0007/core /tmp/core.*

或者更优雅——禁掉 core dump：

ulimit -c 0  # 在 bench harness 开头加

🌟 结论：5 个工程坑都是”看起来跑通了但数据可疑”——每个都让数据无声地降低质量。把它们写进 §6.1 setup 是对后续复现者的尊重。

⭐ 教学重点：这些坑都是从负结果迭代出来的——把它们写进 §6.1 setup 章节，让别人复现时不踩同样的坑。

✅ 自我检验清单

RQ1-RQ5 主张：能默写每个 RQ 的 claim 和 status（done / missing）
5-cell vs 16-cell：能解释为什么 v27 选 5-cell + 16-cell 的算力代价 + “最小可解释 vs 穷举”的取舍
静态 ablation 正确解读：能解释为什么 levers throughput-neutral 不是缺陷而是预期
drift_period_ms 取值：能解释为什么 5000ms / 2000ms 是合理选择，0/100ms/30000ms 不是（planner tick 倍数）
RQ3 双展示：能解释为什么 tuner ablation 必须静态 + drift 同时展示
RQ4 三选一：能说出 W11 finish / LOTUS port / caveat 三种处理 + 各自工作量与风险
RQ5 阻塞：能识别 RQ5 不是 paper deadline 的硬阻塞 + 给出 §6.1 setup 的免责声明
工程坑：能默写至少 3 个工程坑 + 解法（rep 数 / SSH keepalive / Drift::Stop / core dump / bootstrap）

📚 参考资料

概念入门

v27 paper outline §6 + “What’s missing for submission” —— 评测缺口的 source of truth
“How to Write a Great Research Paper” (Simon Peyton Jones) —— evaluation 章节的写作框架

关键论文

DrTM+H (USENIX ATC’18) —— DM 事务系统 evaluation 范式
LOTUS —— RQ4 缺失基线候选
FORD (FAST’22) —— DM 事务 §6 evaluation 章节写作参考
“Statistical Methods in System Research” —— bootstrap CI 方法论

行业讨论

模块零《AI 系统性能工程方法论》—— 系统性能 evaluation 通用方法论
模块二十三《AURA 论文精讲》第9章-评测方法论与端到端复现 —— v25 时期方法论
模块十五《分离式事务的动态锁所有权》第8章-实验方法论把故事讲圆的工程套路

框架文档（代码 anchor）

bench/aura/run_v27_step6.sh —— RQ1 5-cell ablation
bench/aura/run_v27_a1b_smoke.sh —— RQ3 tuner ablation
bench/aura/run_v27_drift_smoke.sh —— RQ2 drift adaptation
bench/aura/run_v27_rdma_dispatch_smoke.sh —— RDMA dispatch smoke（第 7 章已分析）
benchmark/Client/Drift.cc —— drift workload 实现

📎 v25 对照视角：模块二十三-AURA 论文精讲第9章-评测方法论与端到端复现 —— v25 时期方法论侧重端到端复现；v27 把方法论与复现拆成第 8/9 章，加入 5-cell vs 16-cell 取舍 + bootstrap CI 工程坑

搜索