AI 系统性能工程方法论 2026年5月8日

第8章：超大规模分布式训练

3D 并行 + ZeRO + MoE + Sequence Parallelism + 容错——性能工程师视角的万卡训练（待补）

3D 并行 ZeRO FSDP MoE Sequence Parallelism 容错占位

⚠️ 本章正文待补。原书 AI Systems Performance Engineering(Chris Fregly, O’Reilly 2025)的 Early Release 版本中,Ch8 标记为 unavailable。等正式版释出或获取等价资料后再补完整正文。

章节定位

把 Ch1 的 Goodput、Ch2 的 NVL72、Ch4 的通信、Ch10 的 GPT-4.5 / DeepSeek-V3 案例串成一份完整的万卡训练运维指南。

计划覆盖的内容

3D 并行(TP / PP / DP)如何切分、何时切到柜外
ZeRO Stage 1/2/3 + FSDP 选型
序列并行(Sequence Parallelism)与 Ring Attention
MoE 训练:expert parallelism + All-to-All overlap
训练任务的容错:async checkpoint、in-memory state、preemption recovery
“训练突然变慢”排查 SOP
大集群可观测性:DCGM + Prometheus + Grafana 模板
Goodput 25-30% → 75%+ 的工程化提升路径

在补完之前请参考

模块三分布式训练全模块 —— 各种并行策略、Megatron / DeepSpeed
模块零第4章分布式通信与 I/O 优化 —— NCCL / SHARP / Magnum IO
模块零第10章案例 GPT-4.5 / DeepSeek-V3

📚 参考资料

AI Systems Performance Engineering (Chris Fregly, O’Reilly 2025):learning.oreilly.com —— 待原书 Ch8 释出
Hugging Face Ultra-Scale Playbook —— 大集群训练实战指南
Megatron-LM:github.com/NVIDIA/Megatron-LM
DeepSpeed:github.com/microsoft/DeepSpeed