第7章：AI 编译器 - AIInfraGuide

手写 CUDA Kernel 性能极致但开发成本高,AI 编译器通过更高层的抽象在开发效率和性能之间找到平衡。本章覆盖当下最重要的三类:Triton(自定义算子的瑞士军刀)、torch.compile(PyTorch 2.x 的零代码加速)、TVM/XLA(跨硬件部署)。理解它们各自的定位,你才能在工程中选对工具。

📑 目录

1. AI 编译器的全景
2. Triton:Block-level 编程
3. torch.compile:PyTorch 2.x 编译模式
4. TorchInductor 内部
5. TVM 与 XLA 概览
6. 选型决策
自我检验清单
参考资料

1. AI 编译器的全景

工具	抽象层级	性能	开发成本	定位
CUDA / PTX	Thread / Warp	极致	极高	关键算子
Triton	Block(tile)	~95% CUDA	中	自定义融合算子
CUTLASS	Warp / Block	~95% CUDA	中高	GEMM 类算子
torch.compile	Module / Function	1.3-3× eager	极低	全模型加速
TVM / XLA	计算图	灵活	中高	跨硬件部署

🌟 核心趋势:从 CUDA 的”线程级”思维向”块级 / 图级”思维迁移——开发者描述高层意图,编译器自动做底层优化。

2. Triton:Block-level 编程

2.1 Triton vs CUDA 的核心差异

维度	CUDA	Triton
编程粒度	线程	Block(tile)
内存层次	显式管理 Shared / Register	编译器自动
索引	一维(threadIdx)	多维(tl.arange)
Vectorize	手动写 float4	自动
Bank Conflict	手动 padding	自动
Warp 调度	手动	自动

Triton 的承诺:你描述”这一个 block 干什么”,我帮你处理线程级的所有细节。

2.2 第一个 Triton Kernel:Vector Add

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements,
               BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements

    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

def add(x: torch.Tensor, y: torch.Tensor):
    output = torch.empty_like(x)
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']),)
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

10 行代码,性能与手写 CUDA 几乎一致。

2.3 Fused Softmax(Triton 官方教程)

@triton.jit
def softmax_kernel(input_ptr, output_ptr, input_row_stride, output_row_stride,
                   n_cols, BLOCK_SIZE: tl.constexpr):
    row_idx = tl.program_id(0)
    row_start_ptr = input_ptr + row_idx * input_row_stride
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets

    mask = col_offsets < n_cols
    row = tl.load(input_ptrs, mask=mask, other=-float('inf'))
    row_max = tl.max(row, axis=0)
    numerator = tl.exp(row - row_max)
    denominator = tl.sum(numerator, axis=0)
    softmax_output = numerator / denominator

    output_row_start = output_ptr + row_idx * output_row_stride
    tl.store(output_row_start + col_offsets, softmax_output, mask=mask)

Triton 自动:

把 row 装入寄存器或 Shared(根据大小)
用 Warp Shuffle 做 tl.max 和 tl.sum
向量化 load/store
选最优 Block Size

性能约为 PyTorch 原生 softmax 的 1.5-2 倍。

2.4 Autotune

Triton 内置自动调参:

@triton.autotune(
    configs=[
        triton.Config({'BLOCK_M': 64, 'BLOCK_N': 64}, num_stages=3, num_warps=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 64}, num_stages=4, num_warps=4),
        triton.Config({'BLOCK_M': 128, 'BLOCK_N': 128}, num_stages=3, num_warps=8),
    ],
    key=['M', 'N'],
)
@triton.jit
def matmul_kernel(...): ...

第一次运行时尝试所有配置,记录最优参数,后续直接用。

3. torch.compile:PyTorch 2.x 编译模式

3.1 一行代码加速

import torch

@torch.compile
def my_step(x):
    return torch.relu(torch.matmul(x, w) + b)

# 或者整个模型
model = torch.compile(MyModel())

通常 1.3-3× 加速,代码不变。

3.2 内部架构

PyTorch eager code
       ↓
TorchDynamo (Python bytecode → FX Graph)
       ↓
AOTAutograd (前后向图分离)
       ↓
TorchInductor (FX → Triton kernel)
       ↓
GPU

3.3 Graph Break:编译失败的元凶

torch.compile 不是万能的——遇到不支持的操作就”打断”编译,fall back 到 eager 模式,损失大部分收益。

常见 Graph Break:

# ❌ 调用未注册的 Python 函数
def my_func(x):
    return some_external_lib(x)   # graph break

# ❌ 张量与 Python 数据结构互操作
def slow(x):
    if x.sum() > 0:               # graph break(数据相关分支)
        return x * 2
    return x

# ❌ Print / breakpoint
def debug(x):
    print(x.shape)                # graph break
    return x

# ✅ 用 torch.cond 表达数据相关分支
def fast(x):
    return torch.cond(x.sum() > 0, lambda x: x*2, lambda x: x, [x])

3.4 编译模式

torch.compile(model, mode="default")           # 平衡
torch.compile(model, mode="reduce-overhead")   # CUDA Graph 减少 launch
torch.compile(model, mode="max-autotune")      # 充分自动调参,编译慢但运行快

3.5 调试工具

import torch._dynamo

# 看 graph break
torch._dynamo.config.verbose = True

# 看生成的 Triton 代码
TORCH_LOGS="output_code" python my_script.py

# 详细的编译统计
TORCH_LOGS="dynamo,inductor" python my_script.py

4. TorchInductor 内部

TorchInductor 是 torch.compile 的后端代码生成器,把 FX Graph 翻译成 Triton kernel。

4.1 核心优化

优化	描述
Operator Fusion	把 element-wise / reduction 合并
Layout Optimization	选择最优内存布局(NHWC vs NCHW)
Constant Folding	编译期算常量表达式
Pointwise Fusion	多个 element-wise op 合一
Reduction Fusion	reduce + 后续 op 合并
Pattern Matching	识别 LayerNorm/Softmax 等高层模式,用预优化 kernel

4.2 看 Inductor 生成的代码

TORCH_COMPILE_DEBUG=1 python my_script.py
# 在 /tmp/torchinductor_*/ 中找到生成的 .py 文件

会看到一个个 Triton kernel,以及它们的 wrapper Python 代码——非常好的学习材料。

5. TVM 与 XLA 概览

5.1 TVM:跨硬件部署

TVM 的目标:一份模型,跑到任何硬件(NVIDIA GPU / AMD GPU / CPU / 手机 NPU / 嵌入式)。

模型 (PyTorch / ONNX / TF)
    ↓
Relay IR (高层 graph IR)
    ↓ schedule + auto-tuning
TIR (低层张量 IR)
    ↓ codegen
LLVM / CUDA / OpenCL / WebGPU / ...

适用场景:边缘部署、跨硬件移植、需要极致 fine-tuning 时。

5.2 XLA:计算图整体优化

XLA(Accelerated Linear Algebra)是 TensorFlow / JAX 的编译器,核心思想:把整个计算图作为一个单元做优化(算子融合、内存布局、Layout planning)。

import jax
@jax.jit                       # 一行触发 XLA 编译
def my_func(x, w):
    return jax.nn.relu(x @ w)

XLA 在 TPU 上是默认编译器,GPU 上也支持(但 PyTorch 生态用 torch.compile 更多)。

6. 选型决策

我要写一个新的算子 / 融合,目标是性能极致
  ↓
关键算子(Attention / GEMM 类)? → CUDA / CUTLASS
新颖融合 / 自定义量化? → Triton(开发快、性能近 CUDA)

我有现成模型,想自动加速
  ↓
PyTorch 模型? → torch.compile(零改动 1.3-3×)
JAX / TF? → XLA(默认开启)

我要跨硬件部署
  ↓
需要 NVIDIA / AMD / 手机统一? → TVM / Apache MLC
只在 NVIDIA 上跑? → TensorRT / TensorRT-LLM

✅ 自我检验清单

Triton vs CUDA:能解释 Block-level vs Thread-level 的差异,以及为什么 Triton 开发更快
Triton 实战:能写一个 fused softmax 的 Triton kernel,与 PyTorch 对比性能
Autotune:能用 @triton.autotune 调一个 GEMM kernel,理解 Config 和 key 的作用
torch.compile 实战:能用 @torch.compile 包装一个简单模型,对比 eager 性能
Graph Break 排查:训练脚本启用 compile 后变慢,能用 TORCH_LOGS=dynamo 找到 graph break 位置
生成代码:能用 TORCH_COMPILE_DEBUG=1 看到 Inductor 生成的 Triton 代码
mode 选择:能解释 default / reduce-overhead / max-autotune 三种模式的差异
TVM/XLA 定位:能向同事讲清 TVM、XLA、Triton、torch.compile 各自适用的场景

📚 参考资料

Triton 官方教程:https://triton-lang.org/main/getting-started/tutorials/
Triton 论文:https://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
PyTorch 2.0 Compiler Tutorial:https://pytorch.org/get-started/pytorch-2.0/
TorchDynamo Deep Dive:https://pytorch.org/docs/stable/torch.compiler_deepdive.html
TorchInductor Design:https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747
TVM 官网:https://tvm.apache.org/
JAX + XLA 文档:https://jax.readthedocs.io/

搜索