AI 系统性能工程方法论 2026年5月8日

第9章：多节点推理优化

vLLM / TensorRT-LLM / Dynamo 选型决策、speculative decoding、PD 解耦、KV-Cache 优化——性能工程师视角的推理服务（待补）

vLLM TensorRT-LLM Dynamo speculative decoding PD 解耦 KV-Cache 占位

⚠️ 本章正文待补。原书 AI Systems Performance Engineering(Chris Fregly, O’Reilly 2025)的 Early Release 版本中,Ch9 标记为 unavailable。等正式版释出或获取等价资料后再补完整正文。

章节定位

模块零方法论在推理服务侧的落地。Goodput 在推理侧的定义已在 Ch1 出现(满足 SLO 的有效请求占比),本章给完整决策路径。

计划覆盖的内容

推理引擎对比:vLLM / TensorRT-LLM / SGLang / Dynamo 选型决策树
Speculative decoding(draft model + 大模型 verify)的吞吐 / 延迟取舍
PD 解耦推理(基于 NIXL):什么场景值得做、什么场景不值得
KV-Cache 优化:PagedAttention、prefix cache、跨节点 KV 共享
长上下文推理(>1M tokens)的特殊挑战
Dynamic batching / continuous batching / chunked prefill
多模型同卡服务(MIG vs MPS vs time-slicing)
推理服务可观测性:TTFT、TPOT、goodput、queue depth
商业模型推理 vs 开源模型推理的 cost/perf 对照

在补完之前请参考

📚 参考资料

AI Systems Performance Engineering (Chris Fregly, O’Reilly 2025):learning.oreilly.com —— 待原书 Ch9 释出
vLLM:github.com/vllm-project/vllm
NVIDIA TensorRT-LLM:github.com/NVIDIA/TensorRT-LLM
NVIDIA Dynamo:developer.nvidia.com/blog/introducing-nvidia-dynamo