跳到主要内容
分离式事务的动态锁所有权

第9章:端到端实战 —— 在 CloudLab/APT 复现 AURA

从 reservation 申请到 bootstrap、构建、跑通、读 CSV、计算 bootstrap CI 的完整可复现路径;含 ConnectX-3 兼容性陷阱、Clash 路由、故障速查

CloudLab APT 实战 reservation bootstrap ConnectX-3 故障速查 AURA

读完前 8 章,你应该已经把 AURA 的设计与方法论装进脑子。本章是动手章——从在 CloudLab / APT 申请 reservation 开始,把整套实验跑出来。这不是 README,是带着踩坑笔记的工程指南:哪些 ssh 配置 Clash 会劫持、ConnectX-3 上哪几行代码必须改、memcached 为什么经常绑错地址、tmux 为什么是必备、bootstrap_apt_cluster.sh 怎么扩展到新平台。读完你应该能在 1 个工作日内拿到一组可写进 paper 的数据点。

📑 目录


1. 平台选择与 reservation 策略

1.1 三个候选平台对比

平台NIC节点数atomic IOPS适合实验
CloudLab Utah c6525-25gConnectX-6 Dx 100Gb RoCE8 (按需)~5–10 Mpps主战场(CREST 全栈跑通)
CloudLab Utah d6515ConnectX-5 (TBD)32~3–5 Mpps备用(c6525 抢不到时)
CloudLab Clemson r650ConnectX-6 Dx 100Gb16同 c6525备用
APT c6220ConnectX-3 56Gb IB12~2.6 Mpps跨硬件 portability + LOTUS 同硬件复现

1.2 reservation 策略

时间段策略
实验早期 / 探索APT(reservation 容易,硬件够用)
主体实验 / 长跑CloudLab c6525-25g(必抢)
paper deadline 前提前 1–2 周 reserve,避免最后争抢

1.3 reservation 申请文案模板(必须 ASCII-only)

CloudLab 的 reservation 表单不接受非 ASCII——必须把 ×, —, ”, ” 等替换成 ASCII:

We are running a control-plane research project for disaggregated-memory
RDMA transaction systems, targeting a USENIX ATC submission. Our system,
AURA, extends CREST (open-source DM transaction system) with a 5ms-window closed-loop control
plane and shows 1.91x throughput improvement under extreme hot-record
regimes on TPC-C/40W. We specifically need c6525-25g because of its
Mellanox ConnectX-6 Dx NIC.

Hardware requirements:
- 5 x c6525-25g nodes (1 MN + 3 CN + 1 Coordinator)
- Mellanox ConnectX-6 Dx 100Gb RoCEv2

Duration: 5 days (3 days primary + 2 days buffer for retries)

Please grant reservation between 2026-05-12 and 2026-05-17.

关键:被退回时多半是非 ASCII 触发了 form 的字符校验。事先在编辑器里 grep 替换所有 [^\x00-\x7F]

1.4 reservation 时长估算

实验阶段估算时长
Bootstrap 集群(首次 OFED 安装)2–4 小时
单组实验(1 workload + 1 baseline)30 分钟
5 baseline × 3 workload × 3 重复~22 小时
故障恢复 buffer+10 小时
总计~36 小时 / 5 天

🌟 保险做法reserve 时间 = 估算 × 1.5——意外比想象多。


2. SSH 与 Clash 路由踩坑

2.1 Clash TUN 模式劫持 SSH

症状

ping 128.110.219.14   # OK,回复 0% packet loss
ssh chaomei@128.110.219.14   # 无限挂死,最终 timeout

根因:Clash TUN 模式劫持所有 TCP 流量,把 CloudLab 的 SSH 流量误路由到代理 → 代理拒绝 → 连接 hang。ICMP 不走 TUN,所以 ping 通。

2.2 三种解决方案

方案优点缺点
关掉 Clash TUN立即生效浏览失外网
在 Clash 规则前加 IP-CIDR 直连长期有效要改配置
用 IP 直连绕过 fake-ip简单每个域名都要绕

推荐:在 Clash 配置文件最前面加:

rules:
  - IP-CIDR,128.110.219.0/24,DIRECT      # CloudLab Utah
  - IP-CIDR,222.195.68.87/32,DIRECT      # skv 跳板
  - DOMAIN-SUFFIX,cloudlab.us,DIRECT
  - DOMAIN-SUFFIX,emulab.net,DIRECT      # APT
  # ... 其他规则

2.3 SSH 配置(~/.ssh/config)

Host cloudlab-mn
    Hostname amd103.utah.cloudlab.us
    User chaomei
    IdentityFile ~/Desktop/Working/id_ed25519

Host cloudlab-cn0
    Hostname amd118.utah.cloudlab.us
    User chaomei
    IdentityFile ~/Desktop/Working/id_ed25519

Host apt-mn
    Hostname apt052.apt.emulab.net
    User chaomei
    IdentityFile ~/Desktop/Working/id_ed25519

# skv 跳板
Host skv1
    Hostname 222.195.68.87
    Port 6666
    User kvgroup
    IdentityFile ~/Downloads/ssh/id_rsa

# 通过 skv 跳到 CloudLab(Clash 直连不通时的应急路径)
Host cloudlab-via-skv
    Hostname amd103.utah.cloudlab.us
    User chaomei
    IdentityFile ~/.ssh/id_ed25519_cloudlab
    ProxyJump skv1

2.4 IP 直连模板

某些情况下域名也不通(DNS 在 Clash 里被劫持),用 IP:

# 替换域名
ssh -i $CL_KEY chaomei@128.110.219.14   # 等同 amd103.utah.cloudlab.us

CloudLab 节点的 IP 在实验 dashboard 里可查。保留一份 IP-host 对照表

amd103 → 128.110.219.14   # MN
amd118 → 128.110.219.29   # CN0
amd107 → 128.110.219.18   # CN1
amd112 → 128.110.219.23   # CN2
amd119 → 128.110.219.30   # 备用

2.5 SSH 连接失败排错口令

# 1. 看是否网络不通(ICMP)
ping <ip>

# 2. 看是否 SSH 端口不通(TCP 22)
nc -vz <ip> 22

# 3. 看 SSH 详细握手日志
ssh -vvv -i <key> user@<ip>

# 4. 看本地 Clash 是否劫持
curl --max-time 5 ipinfo.io
# 如果返回的 IP 是 Clash 节点 IP → 劫持中

3. 一键 bootstrap:fan-out + 单节点 setup

3.1 bootstrap 核心思路

   主节点(你的 Mac)

        │ rsync OFED tarball / boost tarball 到所有节点
        │ 然后 fan-out 执行 setup_apt_node.sh

   ┌─────────┬─────────┬─────────┐
   │ node0   │ node1   │ node2   │   ...   每个节点本地:
   │ setup   │ setup   │ setup   │           1. apt-mark hold 冲突包
   │ OFED    │ OFED    │ OFED    │           2. install OFED user-space-only
   │ boost   │ boost   │ boost   │           3. install boost / absl / tbb
   │ verify  │ verify  │ verify  │           4. ibv_devinfo 验证
   └─────────┴─────────┴─────────┘

3.2 bootstrap_apt_cluster.sh 关键片段

#!/bin/bash
# 用法:./bootstrap_apt_cluster.sh node0 node1 ... node11

set -e
NODES=("$@")
KEY=~/Desktop/Working/id_ed25519
LOCAL_OFED_TARBALL=~/.apt_bootstrap_cache/MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz
LOCAL_BOOST_TARBALL=~/.apt_bootstrap_cache/boost_1_83.tar.gz

# Step 1: 并行 rsync 到所有节点
for n in "${NODES[@]}"; do
    (
        echo "[$n] rsync OFED + boost..."
        rsync -avz --progress -e "ssh -i $KEY -o LogLevel=ERROR" \
            "$LOCAL_OFED_TARBALL" "$LOCAL_BOOST_TARBALL" \
            "chaomei@$n.apt.emulab.net:~/bootstrap/"
    ) &
done
wait

# Step 2: 并行 setup_apt_node.sh
for n in "${NODES[@]}"; do
    (
        ssh -i $KEY -o LogLevel=ERROR "chaomei@$n.apt.emulab.net" \
            "bash -lc 'cd ~/bootstrap && bash setup_apt_node.sh 2>&1 | tee setup.log'"
    ) &
done
wait

# Step 3: 验证
for n in "${NODES[@]}"; do
    echo "=== [$n] ibv_devinfo ==="
    ssh -i $KEY "chaomei@$n.apt.emulab.net" "ibv_devinfo | grep -E 'hca_id|active_mtu|state'"
done

3.3 setup_apt_node.sh 关键片段

#!/bin/bash
set -e

# 1. 解压 OFED + 安装 user-space-only
tar xf MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64.tgz
cd MLNX_OFED_LINUX-4.9-7.1.0.0-ubuntu20.04-x86_64
sudo ./mlnxofedinstall \
    --user-space-only \
    --without-fw-update \
    --skip-distro-check \
    --without-depcheck \
    --force

# 2. 处理 mstflint 依赖(force purge OFED 自带的,让 22.04 自己的 mstflint 上)
sudo dpkg -P --force-all mstflint || true
sudo apt install -y mstflint

# 3. mass purge inbox 冲突包
sudo apt-mark hold libibverbs1 librdmacm1 libfabric1 libucx0
for pkg in libfabric1 libucx0 libopenmpi3 libcaf-openmpi-3 \
           libboost-mpi-dev libboost-graph-parallel*-dev; do
    sudo dpkg -P --force-all "$pkg" || true
done

# 4. 安装 boost 1.83(CREST 需要)
cd ~
tar xf boost_1_83.tar.gz
cd boost_1_83
./bootstrap.sh
./b2 install --prefix=/usr/local

# 5. 安装 absl + tbb
sudo apt-get download libabsl-dev
sudo dpkg --force-depends -i libabsl-dev*.deb || true
sudo apt install -y libtbb-dev

# 6. 验证
ibv_devinfo | grep -E "hca_id|active_mtu|state"
ldconfig -p | grep boost_system
echo "[setup_apt_node] OK"

3.4 bootstrap 时长预估

步骤时长
rsync OFED 372MB × 12 节点(并行)~5 分钟
OFED install 各节点~10 分钟
boost 编译 + install~15 分钟
各项验证~2 分钟
总计~30–40 分钟

3.5 bootstrap 失败常见原因

失败原因解决
OFED install hangapt 锁住sudo killall -9 apt apt-get; sudo dpkg --configure -a
boost 编译错g++ 版本不对装 g++-11,用 --toolset=gcc-11
ibv_devinfo 报 “no devices”OFED user-space-only 没装 inbox 驱动apt install -y rdma-core
Memcached 启动失败系统自启占了端口sudo systemctl stop memcached; pkill memcached

4. ConnectX-3 兼容性 patch 清单

4.1 三处必改

文件改动原因
src/rdma/QueuePair.cc:519drop IBV_EXP_QP_INIT_ATTR_CREATE_FLAGSConnectX-3 errno=95(不支持)
src/rdma/QueuePair.cc:525max_atomic_arg = 32 → 8ConnectX-3 上限 8
src/util/Logger.h:78std::cout → std::cerr日志缓冲问题
thirdparty/rlib/*.hpp(FORD only)#define DEFAULT_GID_INDEX 3 → 0IB vs RoCEv2

4.2 patch 1: QueuePair comp_mask

// CREST-Opensource-0007/src/rdma/QueuePair.cc

// 原版(ConnectX-5+ OK)
ibv_exp_qp_init_attr attr = {};
attr.comp_mask = IBV_EXP_QP_INIT_ATTR_PD
               | IBV_EXP_QP_INIT_ATTR_CREATE_FLAGS
               | IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG;
attr.max_atomic_arg = 32;

// ConnectX-3 patch
ibv_exp_qp_init_attr attr = {};
attr.comp_mask = IBV_EXP_QP_INIT_ATTR_PD;          // ← 只保留 PD
attr.max_atomic_arg = 8;                            // ← 改成 8

4.3 patch 2: Logger 缓冲

// CREST-Opensource-0007/src/util/Logger.h:78

// 原版
inline std::ostream& Logger::os() { return std::cout; }

// ConnectX-3 patch
inline std::ostream& Logger::os() { return std::cerr; }

为什么改std::cout 默认是 line-buffered(pipe)/ block-buffered(redirect)。bench_runner 输出重定向到 log 文件时 buffer 巨大 → 进程 crash 时日志全丢。std::cerr 默认 unbuffered,能看到崩溃前的最后一条日志。

4.4 patch 3: GID_INDEX (FORD only)

// Comparisons/myford/thirdparty/rlib/rdma_ctrl_impl.hpp
#define DEFAULT_GID_INDEX 0   // 原版是 3

关键:APT 是真 IB 不是 RoCEv2 → gid_idx 必须是 0。

4.5 致命限制:CREST/Motor 不能跑

⚠️ 重要:CREST/Motor 的 IBV_EXP_WR_EXT_MASKED_ATOMIC_CMP_AND_SWP 在 ConnectX-3 上根本不支持——这不是 patch 能解决的,是硬件能力缺失。

实战策略

平台跑什么
CloudLab c6525主体实验:CREST + AURA 全栈
APT ConnectX-3portability claim:FORD baseline + 自制 microbench

5. 跑通:MN → CN → workload 全流程

5.1 启动顺序(必须严格)

   1. 在所有节点上启动 memcached(用于 CN sync)
   2. 在 MN 启动 bench_runner --is_mn=1
   3. 等 MN listen 12347(sync 端口)
   4. 在所有 CN 启动 bench_runner --is_mn=0
   5. CN 自动 connect → workload 开始

5.2 启动 MN

ssh cloudlab-mn

# 1. 关掉系统自启的 memcached
sudo systemctl stop memcached
sudo pkill memcached

# 2. 用正确参数启 memcached
memcached -d -p 11211 -u $(whoami) -m 256 -c 1024 -l 0.0.0.0
ss -tlnp | grep 11211   # 确认绑 0.0.0.0:11211(不是 127.0.0.1)

# 3. 启 MN
cd ~/CREST-Opensource-0007/build
tmux new-session -d -s mn ' \
    LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH \
    ./bench_runner --is_mn=1 --config=cloudlab_tpcc_3cn.json 2>&1 \
    | tee /tmp/mn.log'

# 4. 等 MN 进入 listen 状态
until ss -tlnp | grep -q ':12347 LISTEN'; do
    sleep 0.5
done
echo "[MN] listening on 12347"

5.3 启动 CN

ssh cloudlab-cn0

cd ~/CREST-Opensource-0007/build
tmux new-session -d -s cn0 ' \
    LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH \
    ./bench_runner --is_mn=0 --node_id=0 --config=cloudlab_tpcc_3cn.json 2>&1 \
    | tee /tmp/cn0.log'

# 验证 CN 连上 MN
sleep 5
tmux capture-pane -t cn0 -p | tail -20 | grep "connected to mn"

5.4 监控运行中的 workload

# 实时看 throughput 输出
ssh cloudlab-mn "tmux capture-pane -t mn -p | tail -30"

# 看 abort rate
ssh cloudlab-cn0 "tmux capture-pane -t cn0 -p | grep -E 'tput|abort' | tail -10"

5.5 优雅终止(必须)

# 用 SIGTERM 而不是 SIGKILL,让 RDMA 析构函数执行
ssh cloudlab-cn0 "pkill -TERM bench_runner"
sleep 2
ssh cloudlab-cn0 "tmux kill-session -t cn0"

ssh cloudlab-mn "pkill -TERM bench_runner"
sleep 2
ssh cloudlab-mn "tmux kill-session -t mn"
ssh cloudlab-mn "pkill memcached"

⚠️ 不能用 -9——SIGKILL 跳过析构函数 → MR / QP 没释放 → 下次启动会报错或 OOM kill。


6. 读 CSV + 算 bootstrap CI

6.1 CREST 输出 CSV 格式

# /tmp/cn0.log 末尾
[bench_runner] tput=248723.5 KTPS, abort_rate=0.026, p99_lat=120us
[bench_runner] writing CSV to bench_results.csv

CSV 结构(实际格式按你的 patch):

run_id, workload, baseline, tput, abort_rate, p99_lat, p999_lat
1, tpcc_stable, aura, 248.5, 0.026, 120, 250
2, tpcc_stable, aura, 251.2, 0.024, 118, 245
...

6.2 收集多节点 CSV

# 把所有 CN 的 CSV 拉回本地汇总
for n in cn0 cn1 cn2; do
    scp -i $CL_KEY chaomei@$(get_ip $n):~/bench_results.csv \
        ./results/raw_${n}.csv
done

# 汇总(每个 CN 跑同样 workload,吞吐相加)
python3 aggregate.py ./results/raw_cn*.csv > ./results/agg.csv

6.3 bootstrap CI 脚本

# scripts/compute_ci.py
import pandas as pd
import numpy as np

def bootstrap_ci(samples, n_resample=10000, ci=0.95):
    samples = np.array(samples)
    means = np.array([
        np.mean(np.random.choice(samples, len(samples), replace=True))
        for _ in range(n_resample)
    ])
    return float(np.mean(samples)), \
           float(np.percentile(means, (1-ci)/2*100)), \
           float(np.percentile(means, (1+ci)/2*100))

df = pd.read_csv('./results/agg.csv')

for (workload, baseline), group in df.groupby(['workload', 'baseline']):
    samples = group['tput'].tolist()
    if len(samples) < 3:
        print(f"WARNING: {workload} / {baseline} only {len(samples)} runs")
        continue
    mean, low, high = bootstrap_ci(samples)
    print(f"{workload:15} {baseline:12} = {mean:7.1f} [{low:7.1f}, {high:7.1f}]")

6.4 输出示例

tpcc_stable    mn-only      = 180.3 [173.5, 186.8]
tpcc_stable    routing      = 195.2 [189.1, 201.5]
tpcc_stable    lotus        = 220.7 [216.2, 225.3]
tpcc_stable    aura         = 218.4 [212.5, 223.9]
tpcc_drift     mn-only      =  92.1 [ 88.4,  95.8]
tpcc_drift     routing      =  98.7 [ 94.2, 103.1]
tpcc_drift     lotus        = 105.3 [101.0, 109.7]    # ← 退化
tpcc_drift     aura         = 287.9 [281.5, 294.4]    # ← 大胜

🌟 写 paper 直接抄这种格式——mean [low, high] 是 USENIX 系列的标准。

6.5 出 figure

# scripts/plot_throughput.py
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df = pd.read_csv('./results/agg.csv')
ci_data = compute_all_ci(df)  # 上一节的脚本

fig, ax = plt.subplots(figsize=(8, 5))

baselines = ['mn-only', 'routing', 'lotus', 'aura']
workloads = ['tpcc_stable', 'tpcc_drift', 'tpcc_unknown']
colors = {'mn-only': '#888', 'routing': '#4a90d9', 'lotus': '#f29c20', 'aura': '#d62728'}

x = np.arange(len(workloads))
width = 0.2

for i, b in enumerate(baselines):
    means = [ci_data[(w, b)]['mean'] for w in workloads]
    err_low  = [ci_data[(w, b)]['mean'] - ci_data[(w, b)]['low'] for w in workloads]
    err_high = [ci_data[(w, b)]['high'] - ci_data[(w, b)]['mean'] for w in workloads]
    ax.bar(x + i*width, means, width, yerr=[err_low, err_high],
           label=b, color=colors[b], capsize=4)

ax.set_xticks(x + width*1.5)
ax.set_xticklabels(workloads)
ax.set_ylabel('Throughput (KTPS)')
ax.legend()
plt.savefig('fig3_throughput.pdf', bbox_inches='tight')

7. 故障速查:高频问题与解决

7.1 故障速查表

现象根因解决
MN 不 listen 12347memcached 绑 127.0.0.1手动 stop + 用 -l 0.0.0.0
RDMA read 全 0IOMMU 不是 passthroughiommu=pt reboot
ib_atomic_bw 失败 errno=95NIC 的 max_atomic_arg 过低改成 8
disk full(/tmp 爆)MN spinning + 重定向到日志用 systemd journal 或 tmux capture
节点 panicMR 注册超大调小 hash 表大小(TPC-C constants)
FORD 死锁(connect 时挂死)MN listen 还没 accept 就 close必须等 12347 LISTEN 后再启 CN
port_xmit_wait 永远 0ConnectX-3 firmware bug改用 abort_rate 间接信号
CN sync timeoutmemcached 没绑 0.0.0.0重启 memcached 加 -l 0.0.0.0
Boost link 错误LD_LIBRARY_PATH 没设export LD_LIBRARY_PATH=/usr/local/lib:...
Coroutine stack overflow协程数太多减 coroutines 数(默认 3 → 1)

7.2 IOMMU passthrough 设置

# /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet iommu=pt"

sudo update-grub
sudo reboot

# 验证
dmesg | grep -i iommu
# 期望看到:"iommu: Adding device ... to group X"

7.3 MR 溢出防御

CREST 原版 TPC-C 40W hash 表需要 38GB MR(mr_size = 32GB 必崩)。我们的 patch:

// CREST-Opensource-0007/benchmark/TPCC/TpccConstant.h
constexpr int MAX_PAYMENT_CNT = 100;     // 原版 1000
constexpr int MAX_ORDERID = 3000;        // 原版 20000

// 这两个改完后总数据 5.2GB,mr_size=16 即可

7.4 disk full 救火

# 立即停止 spinning 进程
ssh cloudlab-mn "pkill -TERM bench_runner; pkill memcached"
sleep 2
# 删除大日志
ssh cloudlab-mn "rm -f /tmp/mn.log /tmp/cn*.log"
df -h /tmp

7.5 死锁的 stack inspect

# 找死锁进程
ssh cloudlab-cn0 "ps -ef | grep bench_runner"

# 看 stack
ssh cloudlab-cn0 "sudo cat /proc/<pid>/stack"
# 典型死锁:
#   do_sys_poll
#   inet_csk_accept    ← 等 accept
#   ...

# gdb attach 看更细
ssh cloudlab-cn0 "sudo gdb -p <pid> --batch -ex 'thread apply all bt'"

8. 跑完之后:怎么把数据塞进 paper §6

8.1 数据 → paper 的标准工作流

   raw CSV → bootstrap CI → table

                          plot.py → figure → paper.tex

8.2 LaTeX 表格模板

\begin{table}[t]
  \centering
  \caption{Throughput (KTPS) on TPC-C/40W with 3 CN. Bootstrap 95\% CI from 5 runs.}
  \label{tab:throughput}
  \begin{tabular}{lccc}
    \toprule
    Configuration & Stable & Drifting & Cross-table \\
    \midrule
    MN-only        & 180 \mypm{6}  &  92 \mypm{4} & 150 \mypm{8} \\
    Routing-only   & 195 \mypm{6}  &  99 \mypm{5} & 168 \mypm{7} \\
    LOTUS          & 221 \mypm{5}  & 105 \mypm{4} & N/A \\
    \textbf{AURA}  & \textbf{218 \mypm{6}} & \textbf{288 \mypm{6}} & \textbf{240 \mypm{6}} \\
    \bottomrule
  \end{tabular}
\end{table}

8.3 LaTeX figure 模板

\begin{figure}[t]
  \centering
  \includegraphics[width=0.94\linewidth]{figures/fig3_throughput.pdf}
  \caption{AURA matches LOTUS on stable workloads but achieves $2.7\times$ on
           drifting workloads. Error bars: bootstrap 95\% CI from 5 runs.}
  \label{fig:throughput}
\end{figure}

8.4 文字描述模板

   Section 6.X: Throughput Comparison
   ───────────────────────────────────
   Figure 3 reports throughput across our three workload regimes. 
   On stable TPC-C, AURA (218 KTPS) performs comparably to LOTUS 
   (221 KTPS), with the small gap (-1.4%) attributable to AURA's 
   profile overhead. On drifting workloads where the critical 
   field reverses every 500ms, LOTUS suffers from its 100ms 
   reactive window and degrades to 105 KTPS, while AURA tracks 
   the drift within its 5ms control window and maintains 288 
   KTPS — a 2.7× advantage. The cross-table workload, where no 
   single critical field exists, is outside LOTUS's design 
   space; AURA's online learning identifies the cross-table 
   wid affinity and delivers 240 KTPS.

🌟 写作要点

  • 先描述 figure(“Figure 3 reports…”)
  • 再分场景给数字
  • 对每个数字给原因
  • 数字用 paper 表格里的精确值

把数据回填做成脚本:

# scripts/render_paper_data.py
template = """
\\textbf{{AURA achieves {aura_drift_speedup:.1f}x speedup over LOTUS on drifting workloads}}
"""

ci = load_ci('./results/ci.json')
print(template.format(
    aura_drift_speedup=ci['drift']['aura']['mean'] / ci['drift']['lotus']['mean']
))

8.6 实验完成 checklist

  • 所有数据都跑了至少 3 次
  • 每个 figure 都有 95% CI error bar
  • 至少有 1 个 negative regime figure
  • 至少有 1 个 ablation figure
  • LaTeX 编译无 warning
  • artifact 包含 README + scripts + raw data
  • git tag paper-submitted-v1

✅ 自我检验清单

  • 平台选择:能根据实验目标选合适的 CloudLab profile
  • reservation 文案:能写一份 ASCII-only 的申请文案
  • SSH/Clash:能描述 Clash TUN 劫持的症状与 IP-CIDR 修复
  • bootstrap:能扩展 bootstrap_apt_cluster.sh 到新硬件平台
  • OFED user-space-only:能列出至少 3 个安装坑
  • ConnectX-3 patches:能背出至少 3 个必改文件路径
  • MN/CN race:能解释为什么 CN 必须等 MN listen 之后启动
  • memcached 绑定:能解释为什么必须 -l 0.0.0.0
  • 优雅终止:能解释为什么不能用 kill -9
  • bootstrap CI:能从 CSV 算出 95% CI 并填进 paper
  • 故障速查:能描述至少 5 个高频故障的根因
  • IOMMU passthrough:能解释为什么需要 + 怎么开
  • MR 溢出防御:能描述 TPC-C constants 调整
  • 数据落 paper:能把一组实测数据填进 LaTeX 模板
  • 实验 checklist:能列出实验完成的至少 5 项验收标准

📚 参考资料

概念入门

关键论文

  • CREST(开源 DM 事务系统) —— 第 5/8 节实验载体
  • AURA 论文 §6 Evaluation Setup:本仓库 paper_lock_ownership_cn/sections/6_evaluation.tex(如有)
  • Reproducing Network Research at the Click of a Button —— Mininet & 系统论文 reproducibility 经验

行业讨论

  • 本仓库 PROGRESS.md —— 真实踩坑记录
  • 本仓库 CLAUDE.md —— APT 集群 + CloudLab 配置
  • bootstrap_apt_cluster.sh / setup_apt_node.sh —— 一键脚本

框架文档