第7章：推理性能分析与 Benchmark

性能优化的前提是精准量化——没有可对比的指标,所有的”优化”都是玄学。本章建立推理场景的完整性能评估体系:六大核心指标、三类压测工具、Benchmark 可复现的最佳实践、以及如何把性能门禁集成到 CI 自动 block 退化的提交。

📑 目录

1. 推理性能指标全集
2. 指标间的 trade-off
3. 压测工具:GenAI-Perf 与同门
4. 自定义压测脚本设计
5. 性能分析工具链
6. MLPerf Inference 基准
7. 性能回归门禁
自我检验清单
参考资料

1. 推理性能指标全集

每次性能评测的报告至少要含 6 个指标:

指标	单位	说明
QPS	req/s	服务吞吐
TTFT P50/P95	ms	首 token 延迟,用户感知关键
TPOT P50/P95	ms	每 token 延迟,生成流畅度
End-to-end Throughput	token/s	总产出速率
GPU 显存占用峰值	GB	容量边界
GPU 利用率	%	算力利用率

缺任何一个都可能遗漏瓶颈:

只看 QPS:错过尾延迟问题
只看 throughput:错过 latency 退化
只看 GPU 利用率:错过 memory 浪费

2. 指标间的 trade-off

2.1 Throughput vs Latency

提高并发 → throughput 上升,但 latency 也上升:

Latency
  │      ___---
  │  ___-
  │ -
  │
  └─────────────────── Throughput
   低              高

业务需要在 SLO 范围内最大化 throughput。

2.2 显存利用率 vs 安全边界

gpu_memory_utilization = 0.95  → 高吞吐,可能 OOM
gpu_memory_utilization = 0.80  → 安全,吞吐略降

线上系统强烈推荐 0.85-0.90,留 10-15% 应对峰值波动。

2.3 batch size vs P99

batch size 大 → throughput 高,但批内最长请求拖慢所有请求 → P99 飙升。

3. 压测工具:GenAI-Perf 与同门

3.1 GenAI-Perf(NVIDIA)

LLM 专用,一站式输出 TTFT / TPOT / Throughput:

genai-perf profile \
    -m meta-llama/Llama-3-8B-Instruct \
    --service-kind openai \
    --endpoint v1/chat/completions \
    --url http://localhost:8000 \
    --num-prompts 100 \
    --concurrency 64 \
    --random-seed 42 \
    --synthetic-input-tokens-mean 500 \
    --synthetic-input-tokens-stddev 50 \
    --output-tokens-mean 200 \
    --output-tokens-stddev 20 \
    --measurement-interval 30000

输出:

Statistics:
  TTFT (ms):    avg: 87.3, p50: 80.1, p95: 156.7, p99: 220.4
  TPOT (ms):    avg: 24.5, p50: 23.8, p95: 38.2, p99: 52.6
  Throughput:   3127 token/s
  Request/s:    15.6

3.2 Triton Perf Analyzer

Triton Inference Server 的官方压测工具,通用模型(不限 LLM)。

perf_analyzer -m my_model -u localhost:8000 \
    --concurrency-range 1:64:8 \
    --measurement-interval 5000

3.3 vLLM 自带 benchmark

python benchmarks/benchmark_serving.py \
    --backend vllm \
    --model meta-llama/Llama-3-8B \
    --dataset-name sharegpt \
    --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json \
    --num-prompts 1000 \
    --request-rate 10

4. 自定义压测脚本设计

4.1 关键设计点

# 设计要点
1. 异步并发:asyncio + AsyncOpenAI,模拟真实并发
2. 流式接收:确保 TTFT 准确(收到第一个 token 时间)
3. 真实数据:用业务真实 prompt 分布,不要均匀
4. 充分预热:先跑 50-100 个请求 warmup,避免 cold start
5. 多次重复:跑 3 次取中位数,避免单次抖动
6. 配置可复现:把所有参数写到 config.yaml

4.2 完整模板

import asyncio, time, json
from openai import AsyncOpenAI

async def measure_request(client, prompt, max_tokens):
    t0 = time.perf_counter()
    first_token_time = None
    output_tokens = 0
    stream = await client.chat.completions.create(
        model="...", messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens, stream=True,
    )
    async for chunk in stream:
        if first_token_time is None and chunk.choices[0].delta.content:
            first_token_time = time.perf_counter()
        if chunk.choices[0].delta.content:
            output_tokens += 1
    t_end = time.perf_counter()

    return {
        "ttft": (first_token_time - t0) * 1000,
        "e2e": (t_end - t0) * 1000,
        "tpot": (t_end - first_token_time) * 1000 / max(output_tokens - 1, 1),
        "tokens": output_tokens,
    }

async def benchmark(prompts, concurrency, url):
    client = AsyncOpenAI(base_url=url, api_key="dummy")
    sem = asyncio.Semaphore(concurrency)

    async def task(prompt):
        async with sem:
            return await measure_request(client, prompt, 256)

    results = await asyncio.gather(*[task(p) for p in prompts])

    # 统计
    ttfts = sorted(r["ttft"] for r in results)
    tpots = sorted(r["tpot"] for r in results)
    total_tokens = sum(r["tokens"] for r in results)
    duration = max(r["e2e"] for r in results) / 1000

    print(f"Requests: {len(results)}")
    print(f"TTFT P50: {ttfts[len(ttfts)//2]:.1f} ms")
    print(f"TTFT P95: {ttfts[int(len(ttfts)*0.95)]:.1f} ms")
    print(f"TPOT P50: {tpots[len(tpots)//2]:.1f} ms")
    print(f"TPOT P95: {tpots[int(len(tpots)*0.95)]:.1f} ms")
    print(f"Throughput: {total_tokens / duration:.0f} tok/s")

if __name__ == '__main__':
    prompts = [json.loads(l)['prompt'] for l in open('benchmark.jsonl')]
    asyncio.run(benchmark(prompts, concurrency=64, url='http://localhost:8000/v1'))

4.3 可复现的配置文件

# benchmark_config.yaml
model: "meta-llama/Llama-3-8B"
dataset: "ShareGPT_v3"
num_requests: 1000
concurrency: [1, 4, 16, 64, 256]
input_tokens: {mean: 1000, std: 200, distribution: "normal"}
output_tokens: {mean: 200, std: 50, distribution: "normal"}
warmup_requests: 100
repeat: 3
seed: 42

hardware:
  gpu: "H100-80GB SXM"
  num_gpus: 8
  driver: "535.129.03"
  cuda: "12.3"

framework:
  vllm_version: "0.6.4"
  config:
    tensor_parallel_size: 8
    enable_prefix_caching: true
    max_num_batched_tokens: 8192

把这个 config + 脚本 + 数据集放到一个 git tag,任何人拿着就能复现你的结果。

5. 性能分析工具链

5.1 三层视角

工具	视角	用途
GenAI-Perf	服务级	看 TTFT/TPOT/Throughput
torch.profiler	算子级	哪个 op 最慢?
Nsight Systems	全链路	GPU 在等谁?
Nsight Compute	Kernel 级	这个 kernel 的瓶颈是什么?

5.2 推理场景的 Profiling 套路

1. 先 GenAI-Perf 看哪个指标退化(TTFT? TPOT? throughput?)
   ↓
2. 用 torch.profiler 看哪个 op 占时间最多
   ↓
3. 用 Nsight Systems 看 GPU 是否有 idle gap(CPU bound? 通信?)
   ↓
4. 用 Nsight Compute 看耗时大户 kernel 的 SOL
   ├─ Memory bound? → 算子融合 / 减访存
   └─ Compute bound? → 用 Tensor Core / 量化

6. MLPerf Inference 基准

6.1 定位

MLPerf Inference 是 MLCommons 维护的工业标准 benchmark,统一规则、统一数据集、统一评测指标——任何家厂商提交的成绩可对比。

6.2 LLM 任务

MLPerf v4.1 引入 LLM benchmark:

任务	模型	数据集	SLO
LLM Server	LLaMA-2-70B	OpenORCA	TTFT < 2s, TPOT < 200ms
LLM Offline	LLaMA-2-70B	OpenORCA	无 SLO,纯 throughput

6.3 看懂 MLPerf 提交结果

NVIDIA、Intel、Google 等会提交各自硬件 + 软件栈的成绩。看时关注:

硬件配置(几 GPU,什么型号)
软件栈(TRT-LLM v几,vLLM v几)
量化(BF16 / FP8 / INT8)
离线 vs 服务模式

7. 性能回归门禁

7.1 制定规则

指标	退化阈值	行动
TPOT P95	> 5%	Block merge
TTFT P95	> 10%	Warning + 需说明
Throughput	> 5%	Block
显存峰值	> 10%	需附性能分析报告

7.2 CI 集成

# .github/workflows/perf.yml
name: Performance Gate
on: [pull_request]
jobs:
  benchmark:
    runs-on: gpu-runner
    steps:
      - uses: actions/checkout@v3
      - name: Start vllm
        run: docker run -d ...
      - name: Run benchmark
        run: python bench.py --output result.json
      - name: Compare with baseline
        run: python compare.py --baseline main_baseline.json --new result.json --threshold 0.05
      - name: Upload report
        uses: actions/upload-artifact@v3
        with:
          name: perf-report
          path: report.html

compare.py 退化超过阈值时返回非零,自动 block merge。

7.3 退化定位:git bisect + 性能数据

性能退化时:

git bisect start
git bisect bad HEAD              # 当前 bad
git bisect good v0.6.0           # 已知 good 版本

# bisect 会自动 checkout 中间 commit,跑一次 benchmark
# 标记 good/bad
git bisect run sh -c "python bench.py --check-perf || exit 1"

# 自动定位到引入退化的 commit

然后用 Nsight Systems 对比 good 和 bad 版本的 trace,找出具体是哪个 op 变慢了。

✅ 自我检验清单

指标全集:能说出 6 个推理核心指标和各自含义
trade-off 直觉:能解释为什么提高并发会让 latency 上升
GenAI-Perf 实操:能写一个 GenAI-Perf 命令测一个 vLLM 服务
自定义压测:能写一个异步压测脚本,正确测 TTFT(流式接首 token 时间)
可复现 config:能列出 benchmark 配置必须含哪些字段
profiling 套路:能描述从指标退化到根因定位的 4 步流程
MLPerf 解读:能看懂 MLPerf LLM benchmark 提交结果的关键字段
性能门禁规则:能为团队制定一套合理的 CI 性能回归规则
退化定位:能用 git bisect 找出退化 commit 的完整流程

📚 参考资料

工具

GenAI-Perf:https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/client/src/c%2B%2B/perf_analyzer/genai-perf/
Triton Perf Analyzer:https://github.com/triton-inference-server/perf_analyzer
vLLM benchmark scripts:https://github.com/vllm-project/vllm/tree/main/benchmarks

基准

MLPerf Inference:https://mlcommons.org/benchmarks/inference-datacenter/
MLPerf Inference v5.0 LLM 任务解读 —— MLCommons 报告

Profiling

PyTorch Profiler Tutorial:https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html
Nsight Systems User Guide:https://docs.nvidia.com/nsight-systems/UserGuide/

搜索