分离式事务的动态锁所有权 2026年5月10日

第5章：在线学习与 cohort 划分 —— 从访问图到 cohort 边界

拆解 AURA 的 AccessGraphProfiler / LockCohortGenerator / OwnershipPlanner 三件套：访问图怎么建、cohort 怎么 merge/split、owner 怎么放、滞回如何防抖动

AURA AccessGraph cohort online learning owner placement 滞回

第 4 章把 AURA 骨架立住了，但留下了一个核心问题：cohort 边界是怎么决定的、owner 是怎么选出来的。本章拆解 AURA 控制面的三件套——AccessGraphProfiler 在线建图、LockCohortGenerator 决定 merge/split、OwnershipPlanner 决定 owner 位置。读完你应该能写一个简化版 cohort 学习器的伪代码，并能解释为什么”加滞回比 PID 更好”——这是真实工程经验，论文里只能写一两句。

1. AccessGraph：边权与节点权的物理含义

1.1 图模型定义

AccessGraph $G = (V, E, w)$ ：

$V$ ：节点 = key_group（不是单 key，避免图爆炸）
$E$ ：边 = 共访问关系
$w$ ：边权 = 共访问频率（衰减后）+ 节点权 = 单点访问频率

   节点（key_group）          边（共访问）
   ─────────────────         ─────────────────
   • kg₁: warehouse[wid=1]     kg₁ ─ kg₂ : weight=120
   • kg₂: district[wid=1,*]    kg₁ ─ kg₃ : weight=85
   • kg₃: customer[wid=1,*]    kg₂ ─ kg₃ : weight=200
   • kg₄: stock[wid=1,*]       kg₃ ─ kg₄ : weight=150
   • kg₅: warehouse[wid=2]     kg₅ ─ kg₆ : weight=98
   • kg₆: district[wid=2,*]    ...

1.2 节点权与边权的物理含义

量	物理含义	用途
节点权 $w(v)$	该 key_group 在窗口内被独立访问的次数	决定该 cohort 的”热度”，影响 owner CPU 负载
边权 $w(u,v)$	该对 key_group 在同一事务内同时被访问的次数	决定它们是否应该在同一 cohort

🌟 核心原则：应该聚簇的 = 边权高 / 节点权也高——既经常被一起访问，又访问量大。低热度即使强相关也不值得做 cohort。

1.3 为什么用普通图而不是超边

超边（hyperedge）能更精确表示”3 个 key_group 同时被访问”，但：

维度	普通图（pairwise）	超图（hyperedge）
表达力	两两关系	任意子集关系
计算复杂度	O(\|V\|²)	O(2^\|V\|)
在线维护	容易	难
聚类算法成熟度	Louvain / spectral 都 OK	限于研究阶段

🍎 直觉比喻：超图是”完美但慢”，普通图是”够用且快”——AURA 用普通图配合周期性精化（spectral re-clustering）。

1.4 AccessGraph 的物理存储

struct AccessGraphNode {
    key_group_id_t  id;
    double          weight;          // 衰减后的节点权
    uint64_t        last_update_ts;  // 最近一次更新时间戳
};

struct AccessGraphEdge {
    key_group_id_t  u, v;
    double          weight;
    uint64_t        last_update_ts;
};

class AccessGraph {
    std::unordered_map<key_group_id_t, AccessGraphNode>           nodes_;
    std::unordered_map<KGPair, AccessGraphEdge, KGPairHash>       edges_;
    DecayConfig                                                    decay_;
};

1.5 5ms 窗口够用、1ms 太抖

为什么是 5ms？这是工程权衡：

窗口	优点	缺点
1ms	跟得上漂移	噪声大，单 ms 内只有 ~1k–10k samples
5ms	平衡	是 AURA 默认
10ms	噪声小	漂移检测延迟
100ms	决策稳定	LOTUS 风格，错过快漂移

🧠 关键洞察：5ms 窗口让单窗口 sample 数 ≈ 1 万级——足够用统计意义但又不至于错过快漂移。第 7 章会用频谱分析进一步论证。

2. 采样与衰减：trace 不能无限堆

2.1 为什么不能记录所有 trace

假设每秒 1 M 事务，每个事务平均访问 10 个 key_group：

每秒边事件：10 × 10 = 100 M / s
每边记录 16B → 1.6 GB/s 写入
一小时 5.7 TB

直接写完全不可行。必须用采样 + 衰减。

2.2 Reservoir sampling

经典 reservoir sampling：以 $\frac{k}{n}$ 概率保留每个 sample，最终保持 size = $k$ 的均匀样本：

class ReservoirSampler:
    def __init__(self, k):
        self.k = k
        self.reservoir = []
        self.n = 0
    
    def add(self, item):
        self.n += 1
        if len(self.reservoir) < self.k:
            self.reservoir.append(item)
        else:
            j = random.randint(0, self.n - 1)
            if j < self.k:
                self.reservoir[j] = item

⭕ 互补：reservoir 保证均匀性但对最近事件不偏置——AURA 想”记住最近”，所以加衰减。

2.3 指数衰减：让旧 trace 自然消失

class DecayingCounter:
    def __init__(self, half_life_ms=5):
        self.value = 0.0
        self.last_update = now_ms()
        self.tau = half_life_ms / math.log(2)
    
    def add(self, x=1.0):
        elapsed = now_ms() - self.last_update
        self.value *= math.exp(-elapsed / self.tau)
        self.value += x
        self.last_update = now_ms()
    
    def read(self):
        elapsed = now_ms() - self.last_update
        return self.value * math.exp(-elapsed / self.tau)

🌟 关键性质：每个边权与节点权都用 DecayingCounter——5ms 半衰期意味着 25ms 前的 trace 权重已经降到 1/32。自然遗忘。

2.4 CN 侧 0.5–2% CPU 占用怎么来的

实测分解（c6525, TPC-C, 1M tps）：

项目	CPU 占用
TraceCollector record()	~0.3%
AccessGraph update（hash + 衰减）	~0.5%
周期性 graph snapshot（5ms 一次）	~0.4%
OwnerMap delta apply	~0.2%
合计	~1.4%

⭕ 互补：1.4% 不算什么，但热路径上每个事务只能花 ~10ns 在 trace——必须 hash + atomic add，不能上 lock。

2.5 Sketch 数据结构作为节省方案

更激进的做法：用 Count-Min Sketch 而不是精确边权：

方案	内存	精度
精确 hashmap	1000 cohort × ~10 邻居 = 10000 entries × 16B = 160KB	100%
Count-Min Sketch (4 hash × 4096 bucket)	64KB	概率性

AURA 默认精确（cohort 数不算多），但可选 Sketch 模式应对 cohort 爆炸。

3. cohort 边界决策：merge / split 触发条件

3.1 LockCohortGenerator 的工作流

   每 5ms 一次：
   ──────────────
   1. 取 AccessGraph snapshot
   2. 与上一窗口比 → diff
   3. 对每对相邻 cohort 计算"是否应该 merge"
   4. 对每个 cohort 计算"是否应该 split"
   5. 输出 merge/split deltas → OwnershipPlanner

3.2 merge 决策

触发条件：

  应当 merge(C1, C2) ⟺
      crossEdgeWeight(C1, C2) >= θ_merge × min(intraWeight(C1), intraWeight(C2))
      AND 持续 N=2 个窗口
      AND combinedSize(C1, C2) <= max_cohort_size

参数	经验值	解释
θ_merge	0.6	跨 cohort 边权达到内部 60% 才 merge
N	2	至少 2 个窗口稳定
max_cohort_size	~500 key_group	单 cohort 上限

🍎 直觉比喻：两个 cohort 之间”有 60% 的连线像他们内部一样紧密”才算”该合并”——避免噪声触发。

3.3 split 决策

触发条件：

  应当 split(C) ⟺
      存在 cut 使得 cutWeight / intraWeight <= θ_split
      AND 切后 |C1|, |C2| >= min_cohort_size
      AND 当前 owner 命中率 < θ_local OR 总热度 > 2 × peer_avg

参数	经验值	解释
θ_split	0.2	找到一个切口，让被切边权只占内部 20%
min_cohort_size	5	切完每个子 cohort 至少 5 key_group
θ_local	0.7	owner 命中率掉到 70% 以下时考虑切

3.4 找最优 cut：贪心而不是最大流

理论上找最优 cut 是 NP-hard（min-cut 在普通图上 P 但 weighted balanced cut 是 NPH）。AURA 用贪心：

def greedy_cut(cohort, graph):
    # 按节点权降序排
    sorted_kgs = sorted(cohort.members, key=lambda kg: -graph.node_weight(kg))
    # 用 weighted modularity 局部最优
    best_partition = None
    best_score = -inf
    for seed in sorted_kgs[:10]:  # 试 10 个种子
        part = grow_partition(seed, cohort, graph)
        score = modularity(part)
        if score > best_score:
            best_score = score
            best_partition = part
    return best_partition

⭐ 关键事实：贪心精度 ~85%（vs 最优），但耗时 ~50µs（vs 最优 ~10ms）。5ms 窗口下贪心是唯一选择。

3.5 滞回参数（hysteresis）

注意上面的 θ_merge / θ_split 都是单值，但实际工程中需要滞回带：

   θ_merge_low = 0.55, θ_merge_high = 0.65
   θ_split_low = 0.15, θ_split_high = 0.25

逻辑：

不在合并状态 → 边权升到 0.65 才触发 merge
已在合并状态 → 边权掉到 0.55 才回退

⭕ 避免：θ_merge=0.6 单值时，边权在 0.59–0.61 之间抖动会反复 merge/unmerge——OwnerMap 抖动 → AffinityRouter 命中率掉。

4. OwnershipPlanner：owner 放哪里

4.1 成本函数

  cost(cohort, owner_cn) =
      α × expected_atomic_offload      (越多越好，负号)
    - β × migration_cost                (一次性迁移 RDMA + drain 代价)
    - γ × cross_cohort_rpc_cost         (该 cohort 与其他 cohort 的预期 RPC 数)
    - δ × cn_load_imbalance             (该 CN 已经持有的 cohort 数)
    + ε × stickiness                    (不变 cohort 给 ε 奖励，避免抖动)

参数	经验值	物理含义
α	1.0	主目标：减 atomic
β	0.3	迁移成本权重
γ	0.5	跨 RPC 惩罚
δ	0.1	fairness
ε	0.2	sticky bias

4.2 用贪心 + local search 而不是 ILP

方案	复杂度	决策时间	5ms 内能算几次
ILP（最优）	NP-hard	100ms+	0
贪心 + local search	O(\|cohort\| × \|CN\|)	~500µs	10
周期性 spectral re-clustering	O(\|graph\|³)	数 s（异步运行）	0（异步）

AURA 默认贪心 + 每 1 秒做一次 spectral 矫正。

4.3 约束：CN 容量上限 + fairness

struct CnCapacity {
    cn_id_t  cn;
    uint32_t max_cohorts;          // 硬上限
    uint32_t max_lock_qps;          // 锁请求 QPS 上限
    double   current_cpu_pct;       // 实时 CPU 利用率
};

bool can_assign(cohort, cn) {
    return cn.current_cohorts < cn.max_cohorts
        && cn.expected_qps + cohort.qps < cn.max_lock_qps
        && cn.current_cpu_pct < 0.8;
}

🌟 fairness 重要性：owner CN 自己也会成为热点。如果一个 CN 持有 80% cohort，它就变成新瓶颈——退化到”集中式 owner”，相当于把 atomic 瓶颈搬到 CN 自己。

4.4 placement 决策伪代码

def plan_ownership(cohorts, cns, graph, prev_map):
    new_map = OwnerMap()
    pending = sorted(cohorts, key=lambda c: -c.weight)
    for c in pending:
        # 候选 = 该 cohort 内 key_group 当前最常被访问的 CN
        candidates = top_k_cns_by_access(c, k=3)
        best_cn, best_cost = None, inf
        for cn in candidates:
            if not can_assign(c, cn):
                continue
            cost = compute_cost(c, cn, graph, prev_map)
            if cost < best_cost:
                best_cost, best_cn = cost, cn
        if best_cn:
            new_map[c] = best_cn
        else:
            new_map[c] = FALLBACK
    return new_map

5. 防抖动：滞回 + 冷却窗口

5.1 抖动来源

来源	表现
短时 burst	1ms 内访问突增 → cohort 误升温
GC pause	CN 短暂不响应 → 误认为故障
NIC counter 周期对齐	多 CN 计数同时刷新 → 假漂移
sample 噪声	5ms 内 sample 数小，方差大

5.2 滞回带宽度估算

经验公式：

$\text{hysteresis\_width} = 2 \times \sigma_{\text{noise}}$

其中 $\sigma_{\text{noise}}$ 是单窗口 sample 的标准差。实测：

TPC-C 1M tps，5ms 窗口 → σ ≈ 0.05（边权相对值）
推荐滞回带宽 ≈ 0.10

5.3 冷却窗口（cooldown）

在迁移完成后强制最小 100ms 不再迁移同一 cohort：

class MigrationCooldown {
    std::unordered_map<cohort_id_t, uint64_t> last_migrate_ts_;
    
    bool can_migrate(cohort_id_t c) {
        auto it = last_migrate_ts_.find(c);
        if (it == last_migrate_ts_.end()) return true;
        return now_ms() - it->second > 100;
    }
};

⭐ 设计原则：迁移成本 ~150–350µs，太频繁迁移成本累积起来吃光收益。100ms 冷却窗口 = 至多 10 次/秒 → 1.5–3.5ms/s 在迁移——可接受。

5.4 抖动检测：当滞回不够时

如果滞回 + 冷却仍然抖动（实测中可能发生），AURA 有两道补救：

频率检测：cohort 在 1 秒内 merge/split > 3 次 → 强制锁定 cohort 状态 5 秒
进入”研究模式”：输出 detailed log，下个 epoch 起调高滞回带宽

🧠 关键洞察：任何在线学习系统都需要”知道自己学不动”的 fallback——AURA 的研究模式就是这个角色。

5.5 PID 控制 vs 滞回的工程对比

方案	优点	缺点
PID 控制	理论上有最优响应	调参困难，对噪声敏感
滞回 + 冷却	简单、鲁棒	调整迟缓

⭕ 互补：AURA 选滞回不是因为它更”先进”，而是因为易调试 + 可解释。生产系统更重要的是工程师能看懂控制逻辑，而不是数学最优。

6. 一段简化版伪代码

6.1 主循环

class AuraControlLoop:
    def __init__(self):
        self.profiler = AccessGraphProfiler(window_ms=5)
        self.generator = LockCohortGenerator(merge_low=0.55, merge_high=0.65)
        self.planner = OwnershipPlanner(alpha=1.0, beta=0.3, gamma=0.5)
        self.cooldown = MigrationCooldown(min_ms=100)
        self.transfer = TransferController()
        self.publisher = OwnerMapPublisher()
        
        self.current_map = OwnerMap()
    
    def tick(self):  # 每 5ms 调用一次
        # ❶ 取 trace snapshot
        graph = self.profiler.snapshot()
        
        # ❷ 决定 cohort 边界变化
        cohort_deltas = self.generator.diff(graph, self.current_map.cohorts)
        
        # ❸ 决定 owner placement
        new_map = self.planner.solve(graph, cohort_deltas, self.current_map)
        
        # ❹ 过滤掉冷却窗口内的迁移
        filtered_deltas = []
        for delta in new_map.diff(self.current_map):
            if self.cooldown.can_migrate(delta.cohort_id):
                filtered_deltas.append(delta)
        
        if not filtered_deltas:
            return
        
        # ❺ 执行 4 阶段迁移协议
        self.transfer.apply(filtered_deltas)
        
        # ❻ 发布新 OwnerMap
        new_epoch = self.publisher.publish(new_map)
        
        # ❼ 记录冷却时间戳
        for d in filtered_deltas:
            self.cooldown.mark(d.cohort_id)
        
        self.current_map = new_map

6.2 一个完整迭代示例

   t=0ms   profiler.snapshot() → 1024 nodes, 8742 edges
           generator.diff()    → 3 merges, 2 splits
           planner.solve()     → 5 owner changes
           cooldown.filter()   → 4 changes (1 在冷却中)
           transfer.apply()    → 4 个 4-phase migration 并行
   t=0.3ms 4 个 transition → TRANSFERRING
   t=0.5ms drain done
   t=0.6ms handoff done
   t=0.7ms publish epoch=42
   t=5ms   下一轮 tick

🌟 关键事实：单轮决策 ~700µs（含迁移），剩 ~4.3ms 给数据面跑事务——控制面不抢数据面 CPU。

6.3 这段代码的工程注意点

点	注意
profiler.snapshot()	必须无锁（用 RCU 或 epoch-based GC）
generator.diff()	比较成本高时只对 hot cohort 做
planner.solve()	最坏 ~500µs，超时则 reuse 上轮决策
transfer.apply()	必须串行化 cohort（同一 cohort 不能并发迁两次）
publisher.publish()	atomic CAS 推 epoch，失败重试

7. 与 LOTUS critical field 的对比

7.1 一个具体 TPC-C cross-table 场景

考虑 NewOrder 同时访问 7 张表的场景：

方案	cohort 怎么定义	atomic offload 比例
LOTUS（critical field=wid）	各表按 wid hash → 不同表的 wid=1 在不同 owner	0%（cross-table 无法 offload）
LOTUS（critical field=wid+表名复合）	复杂，需要应用知道每张表的”wid 字段名”	复杂度爆炸
AURA	自动识别 wid=1 的所有相关行 → 同 cohort	~95%

🌟 结论：AURA 在 cross-table 场景下相比 LOTUS 是数量级优势——LOTUS 的 critical field 模型本质上是单表的，跨表必须应用层手工拼。

7.2 在线学的好处不止于此

场景	LOTUS	AURA
schema 变更	应用必须重新声明 critical field	自动重新学
临时 hot key（如黑五）	反应式 100ms	5ms 适应
多 tenant 共享集群	必须按 tenant 切	自动按 tenant 聚簇（如果访问图分明）
索引选择影响访问 pattern	LOTUS 假设失效	AURA 自适应

7.3 AURA 的弱点：profile 开销 + 不完美 cohort

⭕ 互补：AURA 不完美：

在 critical field 已知 + 稳定的工作负载下，AURA 比 LOTUS 慢 0–3%（profile 开销 + 不完美 cohort）
在 cohort 数量爆炸（>10k）时，OwnerMap 可能撑爆 AffinityRouter 缓存
在跨 cohort 比例高（>50%）时退化到 routing-only

🌟 诚实的话：AURA 适用面大，但不是无敌。第 8 章会展开如何把这些边界做成 negative regime 实验。

Reservoir Sampling：Wikipedia
Count-Min Sketch (Cormode & Muthukrishnan, 2005) —— 替代精确 hashmap 的方案

关键论文

MorphoSys (et al., VLDB’21) —— 物理设计在线学习的代价模型设计参考
Lion (et al., 2024) —— 访问图驱动 partition 重排的 prior art
Louvain Community Detection (Blondel et al., 2008) —— 经典图聚类算法
Spectral Clustering (Ng, Jordan, Weiss, NeurIPS’01) —— AURA 周期性矫正用的算法

行业讨论

Hysteresis in Control Systems —— 经典控制论教科书章节
AURA 论文 §3.3 / §4 —— 在线学习与 owner placement 的详细描述

框架文档

CREST 仓库 txn/scheduler/ —— 路由实现参考
本仓库 caesar/ 项目早期 AccessGraph 原型

搜索