第8章：实验方法论 —— 把故事讲圆的工程套路

设计再漂亮，没有可信实验也没用。本章把 AURA 这条研究链路里实验设计的工程套路全部沉淀下来：怎么选 baseline、怎么跑 bootstrap CI、怎么写”negative regime”展示自己 trade-off 的诚实、怎么做跨硬件 portability claim、怎么保证 reproducibility。这一章不教统计学，只教怎么把已有数据写成审稿人无法挑刺的实验段。读完你应该能拿到一份你的数据，独立设计 paper §6 的实验骨架。

📑 目录

1. baseline 选择：design space 里站三个点
2. workload 设计：stable / drifting / unknown
3. bootstrap CI：复现性的最低门槛
4. negative regimes：诚实展示 trade-off
5. 跨硬件 portability claim
6. 可复现包：脚本 + 配置 + seed 全套
7. 写作模式：每个 figure 服务一个 claim
8. 审稿人会问的 8 个问题
自我检验清单
参考资料

1. baseline 选择：design space 里站三个点

1.1 baseline 不是”对比对象”，是”design space 锚点”

很多论文把 baseline 写成”我们要打败的目标”——这是错的方法论。baseline 的真正作用是 design space 锚点：让审稿人看出”AURA 在哪个角落、其他角落是什么”。

AURA 的 baseline 必须覆盖 design space 的至少 4 个点：

Baseline	代表	在 design space 的位置	作用
MN-only (FORD/CREST baseline)	atomic 全走 MN	最左下	物理上限基线
routing-only (CREST routing)	atomic 仍走 MN，但路由聪明	中下	证明 routing 不够
static-owner (LOTUS 复现)	atomic 提到 CN，应用先验	中上	证明 dynamic 比 static 优
AURA	atomic 提到 CN，在线学习	右上 ⭐	主体
(可选) omniscient owner	知道未来的最优 placement	上界	”理论上限”

1.2 为什么必须有 omniscient 上界

omniscient（“知道未来”）baseline = 用 oracle 提前知道未来 5ms 的访问，做最优 placement。

作用	说明
给”理论上限”放一个数	AURA 离 omniscient 多远
暴露 in-band learning 的代价	omniscient 没有 profile 开销
支撑”AURA 已经接近最优” claim	如果 AURA = 95% omniscient，故事很强

🌟 写作技巧：在 paper §6 里写 AURA achieves 92% of omniscient throughput——比 AURA outperforms LOTUS by 2.3× 更有说服力。

1.3 baseline 不能”自己实现得很烂”

⚠️ 常见 pit：审稿人会问”你的 LOTUS 复现是不是没调好”。AURA 必须给 LOTUS 复现一个公平实现：

用 LOTUS paper 的 critical field 配置（不要选最难的）
同样的 OCC 协议、同样的 RDMA 优化
同样的硬件、同样的客户端线程数

🌟 关键原则：让 baseline 在它擅长的工作负载上赢 AURA 一点——这反而增强故事可信度（“AURA 不是无脑赢”）。

1.4 baseline 的实现复杂度

Baseline	实现成本
MN-only (CREST 原版)	0（直接用）
routing-only	低（在 CREST 加路由层）
static-owner (LOTUS)	中（实现 critical-field hash + CN-side lock table）
AURA	高（本路线主体）
omniscient	中（用 trace-driven 离线最优解）

1.5 baseline 命名约定

写 paper 时不要用论文名，用功能描述：

错的写法	对的写法
FORD-baseline	MN-only
LOTUS+	static-owner
our-system	AURA（或 AURA-X 表示某个 ablation）

🍎 直觉：用功能描述能让审稿人看名字就知道在 design space 哪里。

2. workload 设计：stable / drifting / unknown

2.1 三类 workload 对应三类 claim

AURA 的核心 claim 是”在更弱假设下仍然有效”。所以 workload 必须覆盖：

类别	工作负载特征	AURA 应该
stable	critical field 已知 + 不漂移	不输 LOTUS（甚至略输 0–3%）
drifting	critical field 在 1s 内反转	大幅领先（2–5×）
unknown	无明显 critical field（跨表 join）	显著领先 LOTUS（仅 LOTUS 不可用）

2.2 stable workload：TPC-C 标准

TPC-C 是经典 stable 工作负载——wid 是 critical field，分布稳定。

配置	数字
warehouse 数	40W
客户端线程	28
协程 / 线程	3
隔离级别	SR
每组实测	3 次取中位数 + bootstrap CI

2.3 drifting workload：自制脚本

需要自己造漂移工作负载。设计模式：

# drifting_workload.py
def gen_drift_workload(duration_s, drift_period_ms=500):
    """
    每 500ms 反转 hot warehouse：
    Phase 1: warehouse [1-10] hot (90% 访问)
    Phase 2: warehouse [11-20] hot
    Phase 3: warehouse [1-10] hot again
    ...
    """
    phase = 0
    t = 0
    while t < duration_s * 1000:
        if phase == 0:
            yield random.choice(range(1, 11)) if random.random() < 0.9 else random.randint(1, 40)
        else:
            yield random.choice(range(11, 21)) if random.random() < 0.9 else random.randint(1, 40)
        t += avg_txn_duration_ms
        if t // drift_period_ms != phase:
            phase = 1 - phase

2.4 unknown workload：cross-table TPC-C

让 AURA 解决 LOTUS 不能解决的场景——cross-table 共享 wid 的事务：

设计	说明
NewOrder 占 100%（最复杂的 TPC-C 事务）	7 张表全部访问
客户端不声明 critical field	LOTUS 退化到 routing-only
AURA 在线发现 wid 是聚簇键	自学

2.5 workload 与 claim 的精确映射

Figure	Workload	Baseline	Claim
Fig 2	stable TPC-C	LOTUS, MN-only	AURA ≈ LOTUS, both ≫ MN-only
Fig 3	drifting	LOTUS, MN-only	AURA ≫ LOTUS（LOTUS 退化）
Fig 4	cross-table NewOrder-100%	LOTUS, MN-only	AURA ≫ LOTUS（LOTUS 不适用）
Fig 5 (ablation)	drifting	AURA without Loop A/B/C	各组件贡献
Fig 6 (negative)	uniform random	LOTUS, MN-only	AURA = MN-only（无 hot key 时）

3. bootstrap CI：复现性的最低门槛

3.1 为什么必须有 CI

没有 CI 的写法	有 CI 的写法
`AURA achieves 250 KTPS, LOTUS achieves 200 KTPS`	`AURA: 248±5 KTPS, LOTUS: 201±7 KTPS (95% CI)`
审稿人质疑”会不会跑了 1 次就这数据”	一目了然差距 vs 噪声

3.2 bootstrap 算法

import numpy as np

def bootstrap_ci(samples, n_resample=10000, ci=0.95):
    """
    samples: 实测的多次运行结果（每次取中位数）
    返回 (mean, low, high)
    """
    samples = np.array(samples)
    means = np.array([
        np.mean(np.random.choice(samples, len(samples), replace=True))
        for _ in range(n_resample)
    ])
    low = np.percentile(means, (1 - ci) / 2 * 100)
    high = np.percentile(means, (1 + ci) / 2 * 100)
    return float(np.mean(samples)), float(low), float(high)


# 使用示例
runs = [248.5, 251.2, 246.8, 250.1, 247.9]  # 5 次跑
mean, low, high = bootstrap_ci(runs)
print(f"{mean:.1f} [{low:.1f}, {high:.1f}]")
# → 248.9 [247.4, 250.4]

3.3 多少次重复才够

重复次数	CI 宽度	适用
3 次	宽（数据点之间噪声大）	探索性
5 次	中（AURA paper 建议）	生产
10 次	窄	关键 figure
30 次	极窄	overkill

🌟 AURA 实验默认 5 次 + bootstrap CI——足够说服 USENIX 审稿人。

3.4 error bar 怎么画

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
xs = ['MN-only', 'routing', 'LOTUS', 'AURA']
means = [180, 195, 220, 290]
errs = [10, 12, 8, 15]   # 半 CI 宽度

ax.bar(xs, means, yerr=errs, capsize=5)
ax.set_ylabel('Throughput (KTPS)')
plt.savefig('throughput.pdf')

⭐ 写作约定：error bar 是 95% CI 半宽，不是 stddev。两者差倍数大概 2×（5 次重复时）。

3.5 用统计检验决定”显著”

CI 重叠时不能直接说”AURA 显著 > baseline”。用 paired t-test：

from scipy import stats

aura_runs = [248.5, 251.2, 246.8, 250.1, 247.9]
lotus_runs = [200.1, 202.5, 198.9, 201.7, 200.3]

t, p = stats.ttest_ind(aura_runs, lotus_runs)
print(f"p-value = {p}")  # 通常 p < 0.001 才能写 "significantly"

🧠 关键：p > 0.05 时不要写 AURA significantly outperforms，应该写 AURA performs comparably to。

4. negative regimes：诚实展示 trade-off

4.1 为什么写 negative regime 反而加分

不写 negative	写 negative
审稿人会自己想出来 → 质疑	主动写出来 → 显得诚实
”你为什么不报这个 case"	"作者承认 X 场景下不胜出，理由是 Y”
审稿轮回多回合	审稿一次过

4.2 AURA 至少 3 类 negative regime

Negative regime	表现	解释
极端均匀负载	AURA = MN-only（无收益）	没有热点可学
critical field 已知 + 永不漂移	AURA 略输 LOTUS（profile 开销）	~0–3%
跨 cohort 比例 > 50%	AURA 退化到 routing-only	OwnerRpc 代价盖过收益

4.3 怎么把 negative 写得”软”

不要写 AURA fails on uniform workload，要写：

“On uniform workloads with no detectable affinity (Fig. 6), AURA correctly recognizes the absence of locality and falls back to MN-only path, achieving parity with the baseline. The profile overhead is bounded at 1.4% CPU.”

🌟 三步法：

承认场景：明确说哪类 workload AURA 不胜出
解释原因：从设计角度解释（不是 bug）
量化代价：profile 开销具体是多少

4.4 用 ablation 实验展示组件贡献

配置	期望差距	用途
AURA-full	100%	baseline
AURA-no-LoopA	-5%	Loop A 价值
AURA-no-LoopB	-10%	Loop B 价值
AURA-static-cohort	-25%	在线学习的价值
AURA-no-cooldown	-15%（抖动）	防抖的价值

⭐ 每行都对应一个 paper claim——ablation 是论文图最有用的素材。

4.5 把 negative regime 包装成 future work

   Negative regime: 跨 cohort 比例 > 50% 时退化
   ────────────────────────────────────────
   解决 hint：
     - 引入 cohort 之间的"软共享" cooperative locking
     - 或学习 hierarchical cohort（cohort of cohorts）
   
   Paper §7 写法：
     "While AURA degrades when cross-cohort transactions exceed 50%, 
      this points to an interesting direction: hierarchical cohort 
      learning. We leave this to future work."

5. 跨硬件 portability claim

5.1 portability claim 的两种强度

强度	内容	工作量
弱（“代码可移植”）	同样的 binary 在 ConnectX-3 和 6 上都能跑通	低
中（“性能可比”）	不同硬件上 AURA 都比 baseline 强	中
强（“硬件代际成正比”）	AURA 收益随 atomic IOPS 比例放大	高

5.2 ConnectX-3 上跑什么

如第 2 章所说，CREST/Motor 在 ConnectX-3 上完全跑不动（masked CAS 不支持）。但你不需要全栈跑——可以：

实验	跑什么
Atomic IOPS 实测	`ib_atomic_bw`（不需要应用栈）
FORD 跑通	用 standard CAS，FORD 可以
AURA 模拟	在 ConnectX-3 用 FORD baseline + 模拟 owner table

5.3 portability 实验的简洁写法

   Section 6.X: Cross-hardware Portability
   ────────────────────────────────────────
   "We verify AURA's design generalizes across NIC generations.
    Table X reports atomic IOPS measured on:
      - ConnectX-3 (APT cluster)    : 2.6 Mpps
      - ConnectX-6 Dx (CloudLab)    : 5.8 Mpps
    
    AURA on ConnectX-3 (using FORD baseline + simulated owner)
    achieves 1.8× over MN-only, vs 2.5× on ConnectX-6 Dx.
    The relative gain scales with atomic IOPS pressure, 
    confirming the bottleneck identification in §3."

🌟 核心写作技巧：portability claim 不需要在所有硬件上跑全栈——展示同一现象在不同代际下的趋势就足够。

5.4 跨硬件实验的局限说明

⭕ 互补：必须明确说明：

"Limitations: Due to ConnectX-3's lack of masked atomic primitives, 
 we cannot run the full CREST/Motor protocol on the APT cluster. 
 Our cross-hardware claims are based on FORD as a baseline, 
 which uses standard CAS only."

5.5 为什么 portability 有用

价值	说明
增强通用性 claim	”AURA 不依赖特定硬件”
显示 design rigor	作者考虑过硬件差异
给 future work 铺路	异构 NIC 集群是开放问题

6. 可复现包：脚本 + 配置 + seed 全套

6.1 USENIX Artifact Evaluation 标准

近年 USENIX 系列会议（OSDI / NSDI / USENIX ATC）都有 Artifact Evaluation：

等级	要求
Available	代码可下载
Functional	能跑通至少一个实验
Reproducible	能复现 paper 主要图表

AURA 目标至少 Functional，能拿 Reproducible 加分。

6.2 可复现包必备 4 件套

   aura-artifact/
   ├── README.md              # 5 分钟内能跑通的 quickstart
   ├── docker/
   │   └── Dockerfile         # 锁定环境（OFED 4.9 / boost 1.83 / ...）
   ├── scripts/
   │   ├── bootstrap.sh       # 集群 fan-out
   │   ├── run_fig2.sh        # 跑 Fig 2 数据
   │   ├── run_fig3.sh        # 跑 Fig 3 数据
   │   └── plot.py            # 从 CSV 出图
   ├── configs/
   │   ├── tpcc_stable.json
   │   ├── tpcc_drift.json
   │   └── tpcc_unknown.json
   ├── seeds/
   │   ├── workload_a.seed
   │   └── workload_b.seed
   └── data/                  # 我们 paper 用的原始数据
       ├── fig2.csv
       └── ...

6.3 README.md 的标准写法

# AURA Artifact

## Quick start (5 minutes)
1. `git clone <repo>`
2. `docker build -t aura-artifact docker/`
3. `./scripts/bootstrap.sh ./configs/single_node.json`
4. `./scripts/run_fig2.sh`
5. Result CSV in `./results/fig2.csv`

## Hardware requirements
- Minimum: 1 c6525-25g node (CloudLab Utah)
- Full: 5 c6525-25g nodes (1 MN + 3 CN + 1 Coordinator)

## Time required
- Quick: 30 min (single config)
- Full paper figures: ~6 hours

## Reproducing each figure
- Fig 2: `./scripts/run_fig2.sh` → `./scripts/plot.py fig2`
- Fig 3: `./scripts/run_fig3.sh` → ...

6.4 Workload seed：保证可复现的关键

# workload generator with seed
random.seed(seed_from_file('workload_a.seed'))
generate_workload()

⭐ 关键：所有”随机”工作负载都必须有 seed——审稿人能跑出和 paper 图相同的数据。

6.5 数据回填到 paper 的工作流

   Lab → run_fig.sh → fig.csv → plot.py → fig.pdf → paper.tex
   ─────────────────────────────────────────────────────────
   每次 paper 改 figure：
   1. 先调 plot.py（不重跑实验）
   2. 真正缺数据时重跑 run_fig.sh
   3. 永远从 CSV 出 figure，不从内存

7. 写作模式：每个 figure 服务一个 claim

7.1 figure → claim 1:1 映射

每个 figure 对应一个 claim，不要把多个 claim 塞一图。

Figure	Claim
Fig 1 motivation	atomic IOPS 物理墙把扩展卡死
Fig 2 stable	AURA 在 stable 下不输 LOTUS
Fig 3 drifting	AURA 在漂移下领先 LOTUS 2–5×
Fig 4 unknown	AURA 在无 critical field 下仍能聚簇
Fig 5 ablation	每个组件移除后掉多少（5–25% 不等）
Fig 6 negative	极端均匀负载下 AURA = MN-only（profile 开销 ~1.4%）

7.2 figure 标题怎么写

不好的标题	好的标题
`Throughput comparison`	`AURA matches LOTUS on stable TPC-C, gains 3× on drifting workloads`
`Latency vs throughput`	`Tail latency stays bounded under AURA's 5ms reconfiguration window`

🌟 核心原则：figure 标题就是 claim 本身，不是数据描述。

7.3 每个 figure 5 件事

元素	必须
标题（claim 形式）	✓
横纵轴清楚标	✓
Error bar (95% CI)	✓
baselines 颜色一致（贯穿 paper）	✓
caption 一段话解释	✓

7.4 caption 模板

   Figure X: <claim>. <workload description>. 
   <key observation>. <secondary observation if any>.
   
   例：
   Figure 3: AURA 在 drifting workload 下领先 LOTUS 3.2×.
            We compare throughput on TPC-C with critical-field 
            drift every 500ms. AURA tracks the drift within 5ms 
            via online cohort learning, while LOTUS suffers from 
            its 100ms reactive window. Error bars show 95% CI 
            over 5 runs.

7.5 表格的写作模式

表格用于”密集对比”——多 baseline × 多 workload × 多 metric：

配置	TPC-C stable	TPC-C drift	Cross-table
MN-only	180±10	95±5	150±8
routing	195±12	102±6	168±10
LOTUS	220±8	110±9 (退化)	N/A
AURA	218±15	285±12	240±11

🍎 直觉：表是”全景”，figure 是”特写”——两者互补。

8. 审稿人会问的 8 个问题

提前准备答案——这是过 review 的关键：

8.1 Q1: 你的 LOTUS 是不是没调好

准备：

用 LOTUS paper 提到的所有优化
在 stable workload 上跑出和 LOTUS paper 相同水平的吞吐
在 §6 实验设置里明确说”我们的 LOTUS 复现按 paper Section X 实现”

8.2 Q2: AURA 的 5ms 是怎么选的

准备：第 7 章频谱分析 + 实测 sensitivity（4ms / 5ms / 8ms / 16ms 的 sweep）。

8.3 Q3: 为什么 cohort 大小是 200 而不是 100/500

准备：sensitivity 实验，扫 50 / 100 / 200 / 500 / 1000，给出 sweet spot 图。

8.4 Q4: AURA 在 multi-tenant 怎么做

准备：在 §7（discussion）写”AURA does not currently isolate cohorts across tenants. We see this as a natural extension…“

8.5 Q5: 工程复杂度多大

准备：

LOC 统计：AURA 在 CREST 之上加了 X k 行 C++
12 个模块各 ~Y k LOC
在 README 里展示 commit history

USENIX Reproducibility Guidelines：usenix.org/conferences/author-resources
Bootstrap Methods (Efron & Tibshirani, 1993) —— 经典统计教材

关键论文

Statistically Rigorous Java Performance Evaluation (Georges et al., OOPSLA’07) —— 系统论文统计严谨性指南
A Note on Reproducibility (Henderson et al., 2018) —— ML 角度反思 reproducibility，对 sys 同样适用

行业讨论

ACM Artifact Review and Badging (v1.1) —— Artifact 等级官方说明
AURA 论文 §6 Evaluation 设计 —— 本仓库 paper_lock_ownership_cn/sections/6_evaluation.tex（如有）

框架文档

matplotlib / pgfplots 文档 —— figure 渲染
CREST 仓库 benchmark scripts —— 复用现成实验脚本
本仓库 PROGRESS.md —— 实测过程的真实日志（可作为 artifact 一部分）

搜索