第11章：W14 Cohort merge 的稳定性陷阱 —— union-find、Jaccard 与倒排索引

第 10 章把 typed access graph 这块地基打好了：C_ww / C_wr 在控制线程手里，等待下游消费。本章是消费它的第一站 —— W14 CohortGenerator。论文 §3.4 的描述是”边权驱动贪心 merge + split 阈值滞回”，落到代码会撞到一个论文里没讲的工程陷阱：cohort_id 是按”成员里最小的 KG”算出来的，一个边缘成员的进出会让整个 cohort 的 id 跳变；id 跳变下游所有以 id 为键的状态（OwnerLockTable / TransferController 的 cohort 集 / 监控统计）跟着抖动。本章把这条工程线拆给你看：从 union-find 贪心、到 Jaccard 继承、到朴素 O(|known|×|proposed|) 把控制线程冲飞的踩坑、再到 W14.h 倒排索引的修复；最后用 4-CN W=4 的实测数据交代一个反直觉发现：cohort 越稳定，LOCAL hit 反而越低，这条线的工程目标和性能目标不重叠。

📑 目录

1. cohort_id 漂移 —— union-find 之后没说完的故事
2. Jaccard 继承 —— 把”会变的 cohort”变成”语义稳定的 cohort”
3. 朴素 Jaccard 的灾难 —— 控制线程跟不上 sample drain
4. W14.h 倒排索引 —— 把候选集从 ~1500 砍到 ~50
5. MERGE_CONFIRM_WINDOWS=2 —— streak 迟滞窗口
6. 实测反直觉 —— 1229 个 cohort，LOCAL hit 反而崩到 0.30%
7. 反直觉的根因 —— 稳定性目标和 LOCAL hit 目标不重叠
自我检验清单
参考资料

1. cohort_id 漂移 —— union-find 之后没说完的故事

W14 的输入是第 10 章产出的三张 typed edge map（C_ww / C_wr / C_rr）+ 每个 KG 的 EWMA 顶点权重。输出是一组 CohortPlan，每个包含一个 cohort_id 和一组 members。

合并算法就是教科书 union-find + 贪心：

// CohortGenerator.cc:75-124（节选）
// Step 1: seed cohorts from hot vertices.
UnionFind uf;
for (const auto& [kg, w] : vertex_weight) {
    if (w >= vertex_threshold) {
        uf.MakeSet(kg);
    }
}

// Step 2: build candidate edges, score = C_ww + α × C_wr.
absl::flat_hash_map<EdgeKey, double, EdgeKeyHash> combined;
for (const auto& [k, w] : edge_ww)  combined[k] += w;
for (const auto& [k, w] : edge_wr)  combined[k] += merge_alpha_wr * w;

// Step 3: walk edges by descending score, greedy merge.
std::sort(edges.begin(), edges.end(),
          [](const ScoredEdge& a, const ScoredEdge& b) {
              return a.score > b.score;
          });
for (const auto& e : edges) {
    uf.Union(e.a, e.b, max_cohort_size);
}

// Step 4: collect connected components, sort members,
// cohort_id = CohortIdFor(members.front()).

到 Step 4 这里——cohort_id 是怎么决定的？

// CohortGenerator.{h,cc}
struct CohortPlan {
    cohort_id_t id;
    std::vector<key_group_id_t> members;   // sorted ascending
    // ...
};
// id = CohortIdFor(members.front())   // 即 hash(成员里最小的 KG)

🧠 问题就在这里。同一组 KG 在两个相邻 tick 里，只要”最小的成员”换了一个——比如 KG=37 在 tick N 时刚刚跨过 PROMOTE_THRESHOLD、tick N+1 跌回阈值之下——整个 cohort 的 id 就从 hash(37) 跳到 hash(38)。

🍎 直觉比喻：把 cohort 想象成微信群。这套 id 方案相当于”群名就用群里资历最浅的人的名字”。资历最浅的人一退群，整个群就改了名字——别人（OwnerLockTable / TransferController）以为这是一个新群，把旧群的状态全丢了。

下游消费 cohort_id 的地方至少有：

OwnerLockTable —— 按 cohort_id 索引锁表条目；id 变了 = 表里多一条无主条目 + 老条目变孤儿
TransferController —— 4 阶段迁移协议按 cohort_id 编排；id 变了 = 当前在飞的 Freeze/Drain 突然找不到目标
OwnerMap —— 全局 cohort 集快照按 id 比对；id 变了 = 该 cohort 被视为”删一条 + 加一条”，触发不必要的 Publish

⭕ 不要小看这个抖动：实测中如果不管 id 漂移，4-CN W=4 跑 4000 txn 一轮，会产生 8000+ 次 OwnerMap.Publish（远远超过真实 cohort 数）。每次 Publish 走 §5 第 6 章讲的快照重建路径，几十微秒就这么白白吃掉。

🌟 结论：cohort_id = hash(min_member) 这条朴素方案有正确性但没有稳定性。要么换 id 算法、要么在这一层之上加”id 稳定性继承”逻辑。AURA 选了后者。

2. Jaccard 继承 —— 把”会变的 cohort”变成”语义稳定的 cohort”

稳定性的判据应该是”成员集合大致没变”，而不是”最小成员没变”。集合相似度的经典度量就是 Jaccard：

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

W14 的修复就是引入一层 id 继承：每个 tick 产出 raw proposed cohorts 后，对每个 proposed 在历史 cohort 集合（all_known_cohorts_）里找 Jaccard 最高的那个，如果 overlap ≥ JACCARD_INHERIT_THRESHOLD 就继承它的 id，否则才 mint 新 id。

代码核心：

// AuraControlLoop.cc:636-650（节选）
auto jaccard_sorted =
    [](const std::vector<key_group_id_t>& a,
       const std::vector<key_group_id_t>& b) -> double {
    std::size_t inter = 0, uni = 0;
    auto ia = a.begin(), ib = b.begin();
    while (ia != a.end() && ib != b.end()) {
        if (*ia == *ib) { ++inter; ++uni; ++ia; ++ib; }
        else if (*ia < *ib) { ++uni; ++ia; }
        else { ++uni; ++ib; }
    }
    uni += static_cast<std::size_t>(a.end() - ia);
    uni += static_cast<std::size_t>(b.end() - ib);
    return uni == 0 ? 0.0
                    : static_cast<double>(inter) / static_cast<double>(uni);
};

两个细节值得注意：

two-pointer 而不是 hash set：因为 CohortGenerator 保证了 members ascending；all_known_cohorts_[cid] 也是 ascending（每次 inherit 后用新 members 覆盖）；两者直接走 two-pointer merge，单次 Jaccard 计算 O(|a| + |b|)，避免 hash set 的 allocate / hash 开销。
claim 锁互斥：

absl::flat_hash_set<cohort_id_t> claimed_this_tick;
// 每个 proposed cohort 抢一个最 best 的历史 id；
// 已被同一 tick 内别的 proposed 抢走的 id 不再候选 ——
// 避免两个 proposed cohort 因相似度都高而 collision 到同一 parent。

如果不加 claimed_this_tick，会出现两个 proposed cohort 都继承同一个 parent id 的语义崩坏。

🌟 结论一句话：Jaccard 继承把”id = hash(成员)” 这条 brittle 的方案，升级成”id 由成员重叠度决定”——只要 overlap 够高（默认 0.5），id 跨 tick 稳定。

3. 朴素 Jaccard 的灾难 —— 控制线程跟不上 sample drain

理论上 Jaccard 继承很美。落地后第一版（W14.g）是朴素 O(|known| × |proposed|) 搜索 —— 每个 proposed cohort 都和 all_known_cohorts_ 里所有 id 算一次 Jaccard，取 max。

跑 4-CN W=4 时崩了一个意料之外的故障：control thread 的第一个 confirmed cohort 出现在 tick 1901，而 W14g flag 关掉的对照组在 tick 465 就出现。

排查路径：

观察 1：control thread tick 数 / 时间偏低（理论 5ms/tick → 200 tick/s；
        实测只有 ~20-30 tick/s）
观察 2：edge_pending_ 队列长度持续涨（worker 推得快、control thread
        drain 不及，drain 后 edge_pending_ 永远不空）
观察 3：profile Tick() 函数 wall time —— 单 tick 一直跑 30-50ms，
        其中 99% 在 jaccard_sorted 调用栈里

🧠 根因：稳态下 all_known_cohorts_ 在 4-CN W=4 工作负载里能涨到 ~1500（cohort 来来去去都被记下来，retire 后才被 GC）。proposed.size() 通常 200-300。朴素双重循环：

1500 × 250 = 375K 次 Jaccard 调用
每次 Jaccard 平均 2-5µs（成员 vector ~20-50 长）
单 tick Jaccard 总时间 = 375K × 3µs ≈ 1.1 秒

5ms tick 间隔被打成 1.1 秒，control thread 慢 200×。worker 每秒产 ~80K 事务样本，control thread 每秒只能 drain 1 个 tick 的样本 ≈ 80K 样本，完全跟不上 sample 生产速率，最终内存被 sample 队列吃爆。

⭕ 这条故障的可怕之处：编译跑都正常，热路径完全没报错，只是 control thread 暗中变慢了。如果不主动 print tick 节奏 / 测 Tick wall time，根本不会发现这个 200× 退化。

这是模块十五第 5 章 §4 提过的 “PID 控制 vs 滞回的工程对比” 没覆盖的暗角：控制环节自身的计算开销，会反过来摧毁控制环的实时性。

4. W14.h 倒排索引 —— 把候选集从 ~1500 砍到 ~50

修复（W14.h）的核心 insight：绝大多数 proposed cohort 只会和很少几个历史 cohort 有重叠——具体来说，是和”成员 KG 出现过的历史 cohort”。维护一份反向映射就能把朴素双重循环变成”按需查表”。

新的数据结构：

// AuraControlLoop.cc:55-58（节选）
// W14.h: forward index
absl::flat_hash_map<cohort_id_t, std::vector<key_group_id_t>>
    all_known_cohorts_;
// W14.h: inverse index  kg → set of cohort_ids that contain it
absl::flat_hash_map<key_group_id_t, absl::flat_hash_set<cohort_id_t>>
    kg_to_known_cohorts_;

查找逻辑：

// AuraControlLoop.cc:666-690（节选）
absl::flat_hash_set<cohort_id_t> candidates;
for (auto& raw : proposed) {
    candidates.clear();
    candidates.reserve(raw.members.size() * 4);
    for (auto kg : raw.members) {
        auto it = kg_to_known_cohorts_.find(kg);
        if (it == kg_to_known_cohorts_.end()) continue;
        for (cohort_id_t cid : it->second) {
            if (!claimed_this_tick.count(cid)) {
                candidates.insert(cid);
            }
        }
    }
    // ↑ candidates.size() 通常 5-50，远小于 all_known_cohorts_.size() ~1500
    cohort_id_t best_id = INVALID_COHORT;
    double      best_j  = JACCARD_INHERIT_THRESHOLD;
    for (cohort_id_t cid : candidates) {
        // 真正的 Jaccard 计算只跑这 5-50 次
        const double j = jaccard_sorted(raw.members,
                                        all_known_cohorts_[cid]);
        if (j > best_j) { best_j = j; best_id = cid; }
    }
    // ...
}

复杂度对照表：

方案	per proposed cohort 复杂度	单 tick 总时间（4-CN W=4 稳态）
W14.g 朴素双重循环	`O(	all_known_cohorts_
W14.h 倒排索引	`O(	members

🌟 数量级修复：300× 加速，从”control thread 跟不上”变成”control thread 富余”。

⭐ 副作用：维护成本。倒排索引每次 inherit 都要做”先按旧成员擦反向条目，再按新成员添反向条目”：

// AuraControlLoop.cc:709-722
auto& slot = all_known_cohorts_[best_id];
for (auto kg : slot) {                       // 1. 删旧条目
    auto kit = kg_to_known_cohorts_.find(kg);
    if (kit != kg_to_known_cohorts_.end()) {
        kit->second.erase(best_id);
        if (kit->second.empty()) {
            kg_to_known_cohorts_.erase(kit);
        }
    }
}
slot = raw.members;
for (auto kg : slot) {                       // 2. 加新条目
    kg_to_known_cohorts_[kg].insert(best_id);
}

这部分加起来 ~O(|members| × 2) 写操作，对 max_cohort_size ≤ 500 来说每 tick 几十微秒，cost 几乎可忽略。

🧠 可推广的工程教训：任何”在历史集合里找最相似”的搜索，都该问一句”有没有反向索引能把候选集砍到亚线性”。这套套路在 ANN（vector index）、文档去重（SimHash + LSH）、字符串 fuzzy match（trigram inverted index）里反复出现，本质都是”用一份额外的索引换 200× 的查找速度”。

5. MERGE_CONFIRM_WINDOWS=2 —— streak 迟滞窗口

Jaccard 让 cohort_id 在跨 tick 间稳定，但还有一种漂移没解决：单 tick 产出的 proposed plan 本身有噪声——比如某个 worker 的 thread_local cache 还没 flush 干净、某个 5ms 窗口里事务样本恰好少。如果 control thread 每 tick 都把 proposed plan 直接 PublishCohortPlan，OwnerMap 会被频繁重发。

W14.e 加了一层 streak hysteresis：

// AuraControlLoop.cc:736-739（节选）
// proposed cohort_id present this tick → streak++ (capped)
// cohort_id absent this tick → streak-- (capped)
// commit when streak ≥ MERGE_CONFIRM_WINDOWS;
// retire when streak ≤ −SPLIT_CONFIRM_WINDOWS.

实际代码：

// AuraControlLoop.cc:755-765
for (auto id : all_ids) {
    if (proposed_by_id.count(id)) {
        int& s = cohort_streak_[id];
        s = (s < 0) ? 1 : std::min(s + 1, kStreakCapPos);
    } else {
        int& s = cohort_streak_[id];
        s = (s > 0) ? -1 : std::max(s - 1, kStreakCapNeg);
    }
}

参数：

// AuraConfig.h:43
static constexpr std::uint32_t MERGE_CONFIRM_WINDOWS = 2;

含义：一个 proposed cohort 必须连续出现 2 个 tick 才会被 publish。一个 publish 过的 cohort 必须连续消失 2 个 tick 才会被 retire。

📑 三种参数选型的 trade-off：

MERGE_CONFIRM_WINDOWS	行为	弊端
1（无迟滞）	立即 publish	OwnerMap 抖动严重，TransferController 跟着抖
2（当前）	一次 sample 噪声不会引起 publish	引入 5ms × 2 = 10ms 的稳定性延迟
5（保守）	真实漂移要 25ms 才反应	drift workload 跟不上、迁移决策滞后

⭕ 2 是工程实验定下来的甜点：模块十五第 7 章给的 drift workload 切换频率是 100-500ms（key 热点漂移）；10ms 的迟滞延迟比这低一个量级，不会让 cohort 决策跟不上。

🌟 结论：Jaccard 解决”id 跨 tick 稳定”；streak hysteresis 解决”proposed plan 抗噪”。两层防抖叠加，cohort 输出对下游就是真正稳定的语义对象。

6. 实测反直觉 —— 1229 个 cohort，LOCAL hit 反而崩到 0.30%

把 W13 + W14（含 .h 倒排索引 + .e streak hysteresis + Jaccard 继承）一并打开，跑 4-CN W=4 的 ablation 矩阵：

配置	4-CN W=4 聚合 commit (KOPS, 中位数)	vs baseline
baseline (CREST 原版)	185.24	—
W7.4-B 本地 takeover 独立 owner	189.28 (n=1，2 个 partial CN hang)	+2.18%
W14 access-graph cohort plan (本章这条)	184.86 (n=2)	−0.21%

控制环跑出来的 cohort 健康度指标：

指标	实测值	解读
总 cohort 数	1229	比 W8.5 hash_det 的 ~250 单 KG 多了 5×，确实把热点聚类了
multi-KG cohort 数	29	大部分仍是单 KG，但有 29 个真正的”簇”
最大 cohort size	42	单簇最多吸纳 42 个 KG
Jaccard inheritance ratio	100%	稳态后所有 cohort id 都从历史继承（W14.h 倒排索引生效）
LOCAL hit rate	0.30%	灾难 —— 跨 CN 写锁路径几乎全部走 MN CAS

🧠 第一眼读这张表：cohort 数翻 5×、最大 size=42（明显在合并）、id 100% 继承（W14.h 显然在工作）……所有”工程指标”都说 W14 在干活。但 LOCAL hit 从 W7.4-B 的 ~40% 崩到 0.30%——为什么？

这就是 cohort merge 这条线最反直觉的发现。

7. 反直觉的根因 —— 稳定性目标和 LOCAL hit 目标不重叠

要解释 0.30% LOCAL hit，需要回到 W7.4-B 这条 baseline 的工作方式。

W7.4-B 是怎么挣 +2.18% 的？

W7.4-B 没有 cohort merge，每个 CN 各自跑控制环、各自把”自己看到的 hot KG”提升成 OWNED@self_cn —— 用第 4 章的语言说，每个 CN 独立构建自己的小宇宙。两个 CN 可以同时把 KG=37 提升成 OWNED@cn0 和 OWNED@cn1，互不知情。这在 §3.4 不变式 I1 下是违法的（同 epoch 一个 cohort 只能一个 authority），但是只要事务 fallback 路径还在（OWNED 的 acquire 也会先 try local，失败 fallback to MN CAS），不变式被打破不会导致正确性问题。

W7.4-B 的 LOCAL hit ≈ 40% 就是因为”每个 CN 都觉得 hot KG 是自己的”——任意 CN 自己的事务，acquire 自己 OwnerLockTable 里的 KG 都是 LOCAL hit。

W14 是怎么把 LOCAL hit 干到 0.30% 的？

W14 + W18 plan consensus 一开，每个 cohort 的 owner 就由全局 argmax 决定（取决于哪个 CN 在 cross-CN access summary 矩阵里对这个 cohort 的访问最多）。也就是说，每个 cohort 只有一个真实 owner，其它 3 个 CN 看到这个 cohort 都得走 REMOTE_OWNER（要么 RPC takeover、要么 fallback to MN CAS）。

4-CN 平均切分：每个 CN 只 “拥有” 1/4 的 cohort —— 也就是说单 CN 自己跑事务，acquire 这个 cohort 的概率从 ~40%（W7.4-B “全是我的”）变成 ~25%（每 4 个 CN 抢一个）。再叠上”events 这个 cohort 的事务被 W15 router 改派到 home_cn”（home_cn ≠ self_cn 的话事务整个被 ExecuteTxn 派走），剩下能在本地命中的 cohort 比例进一步缩小。

🍎 直觉比喻：W7.4-B 让 4 个 CN 各自当 4 家小银行的”全能柜员”——每家都能服务所有客户，重复但快。W14 + W18 把它整成”4 家专科银行”——每家只服务自己专长的 1/4 客户，理论上更专业，但客户跑错家就要被指引到对的家，整个流程比”每家都接”反而慢。

🌟 关键洞察：W14 的”cohort 稳定性”是为正确实现 §3.4 的不变式 I1 服务的；它不是为吞吐 LOCAL hit 服务的。把 W14 当 LOCAL hit 优化点考核，就是在用斧头量水深——这是工程目标和性能目标的错配。

📑 W14 真正的价值在哪？

想法	是否成立	理由
W14 → 直接涨 LOCAL hit	❌	见上文，反而拉低
W14 → 直接涨吞吐	❌	−0.21%，噪声内甚至负
W14 → 让 §3.4 不变式 I1 真正可执行（同一 epoch 单一 authority）	✅	没有 cohort 稳定 id，I1 根本无法在工程上落地
W14 → 让 TransferController 4 阶段迁移协议有意义的对象	✅	freeze/drain/handoff/publish 的”目标”必须是稳定 cohort
W14 → 让跨 CN 共识有可比较的对象	✅	W18 的 plan consensus 需要”所有 CN 对同一个 cohort 集合达成共识”

⭕ 写论文 §6 的时候：W14 这条 ablation 行不能用”+−delta” 表征，应该用”enables I1 + TransferController + W18” 这种语言。可以挂的 metric 是：cohort_id 稳定率（100% inheritance）、multi-KG cohort 占比、迁移协议正确执行的次数。不能挂吞吐 delta。

🌟 本章 takeaway 一句话：W14 是为”分布式正确性”服务的层，不是为”单点性能”服务的层。把它当性能项打分，得到 −0.21% 是符合预期的；它的真正考核指标在第 12 章的 W15/W18 那条线上。

⭐ W15 事务级路由 + W16 跨 CN access summary + W18 plan consensus 让 W14 这条”稳定的 cohort 抽象”开始有意义——以及 W18 上线后再次反直觉的实测发现，见下一章。

✅ 自我检验清单

cohort_id 漂移：能说出”id = hash(min_member)” 这条朴素方案为什么不稳定，以及边缘成员进出会导致下游 OwnerLockTable / TransferController / OwnerMap 哪几处状态抖动。
Jaccard 继承：能写出 Jaccard overlap 的公式，知道 W14 用 JACCARD_INHERIT_THRESHOLD（默认 0.5）作为继承门槛；解释 claimed_this_tick 互斥的必要性（避免两个 proposed 抢同一 parent id）。
朴素 Jaccard 灾难：能复述”control thread 在 1500×250 双重循环下被打到 ~30 tick/s”的根因，以及为什么这是个静默故障（无报错、只是 control thread 暗中变慢）。
W14.h 倒排索引：能解释 kg_to_known_cohorts_ 的功能，候选集为什么能从 ~1500 砍到 ~50，以及维护成本（每次 inherit 要做 O(|members| × 2) 写操作）。
MERGE_CONFIRM_WINDOWS=2：能解释 streak hysteresis 为什么是 2 而不是 1 或 5；这个 5ms × 2 = 10ms 的迟滞延迟为什么比 drift workload 切换周期低一个量级是合理的。
W14 单 ablation 不动 / 反而 −0.21%：能解释这条数据不是 bug，根因是”W14 把 LOCAL hit 从 W7.4-B 的『每个 CN 都觉得 hot KG 是自己的』改成『每 cohort 只有一个真实 owner』”。
W14 真正的 ablation 价值：能说出 W14 不应该用吞吐 delta 考核，正确的 metric 是 cohort_id 稳定率 / multi-KG cohort 占比 / I1 可执行性——它是分布式正确性层、不是性能层。

📚 参考资料

概念入门

“Union-Find Data Structure” —— CLRS 第 21 章：union-find 经典实现 + 时间复杂度证明。W14 的 union-find 用 path compression + size 限制，单步 amortized 近 O(1)。
“Jaccard Index” —— Wikipedia：en.wikipedia.org/wiki/Jaccard_index，集合相似度定义 + 在文档去重 / 推荐系统里的经典应用。
AURA paper § 3.4 (本仓库 paper_lock_ownership_atc_en/sections/3_design.tex)：cohort merge / split / transfer / evict 的形式化描述；本章的工程实操对照点。

关键论文

“Stable Marriage and Stable Roommates” —— Gusfield & Irving, 1989：稳定匹配理论。Jaccard 继承本质上是”每 tick 做一次 raw cohort ↔ historical cohort 的稳定匹配”——更复杂的版本可以用 Gale-Shapley，但 AURA 选了 greedy claim + threshold 因为代价更低。
“Approximate Nearest Neighbor Search via Inverted Index” —— Bochkovskii et al., NeurIPS 2022：倒排索引在亚线性搜索里的现代工程实践，与 W14.h 的 kg→cohort_ids 反向映射本质同构。

行业讨论

“Why my real-time scheduler can’t catch up” —— Bryan Cantrill (DTrace blog, 2016)：实时控制环节自身计算开销摧毁实时性的经典案例，对应本章 §3 的故障模式。
“Tunable Hysteresis in Control Loops” —— LWN.net 2020-08 系列：迟滞窗口在 Linux thermal / cpufreq governor 里的工程实践；MERGE_CONFIRM_WINDOWS=2 的”甜点参数”思路同源。

框架文档

CohortGenerator 源码：本仓库 CREST-aura-impl/CREST-Opensource-0007/src/transaction/aura/CohortGenerator.{h,cc} —— union-find + greedy merge 的完整实现。
AuraControlLoop W14 段：src/transaction/aura/AuraControlLoop.cc:600-770 —— Jaccard / W14.h 倒排索引 / streak hysteresis 的完整代码。
absl::flat_hash_set / flat_hash_map：abseil.io/docs/cpp/guides/container —— 倒排索引选型为 flat_hash 而不是 std 容器的原因（小 value、insert/erase 频繁场景下快 1.5-2×）。