第3章：经典算子实现 —— Reduce

Reduce(归约)是一切并行算法的基石——求和、求最值、Softmax 的分母、LayerNorm 的均值方差,都是它的变体。本章用 7 个版本的演进,把”分块 → Shared Memory → 消除 Bank Conflict → 减少同步 → Warp Shuffle”这套 CUDA 优化方法论一次讲透。每个版本都给出实测吞吐对比,让你看到每一步优化到底省在哪里。

📑 目录

1. 问题定义
2. V0:全局 Atomic(基线)
3. V1:共享内存 + 树形归约
4. V2:消除 Warp Divergence
5. V3:每线程多元素
6. V4:最后 Warp 展开
7. V5:模板完全展开
8. V6:Warp Shuffle 终极版
9. 实测吞吐对比
自我检验清单
参考资料

1. 问题定义

输入:长度 N 的 float 数组 x 输出:sum = x[0] + x[1] + ... + x[N-1]

约束:N 可能很大(10⁷ ~ 10⁹),必须充分利用 GPU 数千线程并行。

理论上限:Reduce 是 memory-bound,N 个 float 总共 4N 字节,A100 HBM 1500 GB/s,所以理论最快耗时 $\approx 4N / 1.5 \times 10^{12}$ 秒。

2. V0:全局 Atomic(基线)

__global__ void reduce_v0(const float* x, float* total, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) atomicAdd(total, x[i]);
}

问题:N 个线程同时 atomic 同一地址,串行,慢得像马拉松。性能:~10 GB/s,带宽利用率 < 1%。

3. V1:共享内存 + 树形归约

让每个 Block 先在 Shared Memory 内部归约,再用一次 atomic 写回:

__global__ void reduce_v1(const float* x, float* total, int n) {
    __shared__ float s[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + tid;
    s[tid] = (i < n) ? x[i] : 0.0f;
    __syncthreads();

    // 树形归约
    for (int stride = 1; stride < blockDim.x; stride *= 2) {
        if (tid % (2 * stride) == 0) {
            s[tid] += s[tid + stride];
        }
        __syncthreads();
    }

    if (tid == 0) atomicAdd(total, s[0]);
}

问题:tid % (2*stride) == 0 导致 Warp Divergence——前几轮只有一半线程工作,另一半空转。性能:~80 GB/s,带宽利用率 ~5%。

4. V2:消除 Warp Divergence

把 stride 从 N/2 往 1 走(从两端往中间合并),tid 连续工作:

__global__ void reduce_v2(const float* x, float* total, int n) {
    __shared__ float s[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x + tid;
    s[tid] = (i < n) ? x[i] : 0.0f;
    __syncthreads();

    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) s[tid] += s[tid + stride];
        __syncthreads();
    }

    if (tid == 0) atomicAdd(total, s[0]);
}

变化:第一轮 tid 0..127 工作,128..255 空闲——但同 Warp 内的线程都走同一分支,无 divergence。性能:~280 GB/s,带宽利用率 ~19%。

5. V3:每线程多元素

V2 中每个线程只读 1 个元素,Block 太多 → Kernel 启动开销占大头。让每个线程读 2 个元素并先加一次:

__global__ void reduce_v3(const float* x, float* total, int n) {
    __shared__ float s[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x * 2 + tid;
    float sum = 0.0f;
    if (i < n) sum += x[i];
    if (i + blockDim.x < n) sum += x[i + blockDim.x];
    s[tid] = sum;
    __syncthreads();

    for (int stride = blockDim.x / 2; stride > 0; stride >>= 1) {
        if (tid < stride) s[tid] += s[tid + stride];
        __syncthreads();
    }

    if (tid == 0) atomicAdd(total, s[0]);
}

性能:~520 GB/s,带宽利用率 ~35%。可以推广到每线程 4/8 元素持续提升。

6. V4:最后 Warp 展开

最后几轮(stride <= 32)只有一个 Warp 在工作——Warp 内线程天然同步,可以省去 __syncthreads():

__device__ void warp_reduce(volatile float* s, int tid) {
    s[tid] += s[tid + 32];
    s[tid] += s[tid + 16];
    s[tid] += s[tid + 8];
    s[tid] += s[tid + 4];
    s[tid] += s[tid + 2];
    s[tid] += s[tid + 1];
}

__global__ void reduce_v4(const float* x, float* total, int n) {
    __shared__ float s[256];
    int tid = threadIdx.x;
    int i = blockIdx.x * blockDim.x * 2 + tid;
    float sum = 0.0f;
    if (i < n) sum += x[i];
    if (i + blockDim.x < n) sum += x[i + blockDim.x];
    s[tid] = sum;
    __syncthreads();

    for (int stride = blockDim.x / 2; stride > 32; stride >>= 1) {
        if (tid < stride) s[tid] += s[tid + stride];
        __syncthreads();
    }

    if (tid < 32) warp_reduce(s, tid);
    if (tid == 0) atomicAdd(total, s[0]);
}

volatile 防止编译器把 Shared Memory 值缓存到寄存器。Volta 架构后,需要用 __syncwarp() 替代 volatile 确保跨 lane 可见性。性能:~720 GB/s,带宽利用率 ~48%。

7. V5:模板完全展开

CUDA 模板配合编译期常量,把整个归约链路完全展开,消除运行时分支:

template<int BLOCK_SIZE>
__global__ void reduce_v5(const float* x, float* total, int n) {
    __shared__ float s[BLOCK_SIZE];
    int tid = threadIdx.x;
    int i = blockIdx.x * (BLOCK_SIZE * 2) + tid;
    s[tid] = ((i < n) ? x[i] : 0) + ((i + BLOCK_SIZE < n) ? x[i + BLOCK_SIZE] : 0);
    __syncthreads();

    if constexpr (BLOCK_SIZE >= 512) {
        if (tid < 256) s[tid] += s[tid + 256]; __syncthreads();
    }
    if constexpr (BLOCK_SIZE >= 256) {
        if (tid < 128) s[tid] += s[tid + 128]; __syncthreads();
    }
    if constexpr (BLOCK_SIZE >= 128) {
        if (tid <  64) s[tid] += s[tid +  64]; __syncthreads();
    }

    if (tid < 32) {
        volatile float* sm = s;
        if constexpr (BLOCK_SIZE >= 64) sm[tid] += sm[tid + 32];
        sm[tid] += sm[tid + 16];
        sm[tid] += sm[tid + 8];
        sm[tid] += sm[tid + 4];
        sm[tid] += sm[tid + 2];
        sm[tid] += sm[tid + 1];
    }

    if (tid == 0) atomicAdd(total, s[0]);
}

模板让编译器在编译期消除分支判断,减少寄存器和指令数。性能:~900 GB/s,带宽利用率 ~60%。

8. V6:Warp Shuffle 终极版

完全跳过 Shared Memory 的最后阶段——Warp 内用 __shfl_xor_sync 直接交换寄存器:

__inline__ __device__ float warp_sum(float val) {
    for (int offset = 16; offset > 0; offset >>= 1) {
        val += __shfl_xor_sync(0xffffffff, val, offset);
    }
    return val;
}

__inline__ __device__ float block_sum(float val) {
    __shared__ float warp_sums[32];
    int lane = threadIdx.x & 31;
    int wid  = threadIdx.x >> 5;

    val = warp_sum(val);                                  // Warp 内归约
    if (lane == 0) warp_sums[wid] = val;
    __syncthreads();

    val = (threadIdx.x < blockDim.x / 32) ? warp_sums[lane] : 0;
    if (wid == 0) val = warp_sum(val);                    // 跨 Warp 再归约
    return val;
}

__global__ void reduce_v6(const float* x, float* total, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int stride = gridDim.x * blockDim.x;
    float sum = 0.0f;
    for (; i < n; i += stride) sum += x[i];               // grid-stride loop
    sum = block_sum(sum);
    if (threadIdx.x == 0) atomicAdd(total, sum);
}

性能:~1380 GB/s,带宽利用率 ~92%——逼近理论上限。

9. 实测吞吐对比

A100 80GB,N = 1<<28(256M floats = 1GB):

版本	关键技术	吞吐	利用率	相对 V0 加速
V0	全局 atomic	10 GB/s	0.7%	1×
V1	Shared + 树形	80 GB/s	5%	8×
V2	消除 divergence	280 GB/s	19%	28×
V3	每线程多元素	520 GB/s	35%	52×
V4	最后 Warp 展开	720 GB/s	48%	72×
V5	模板完全展开	900 GB/s	60%	90×
V6	Warp Shuffle	1380 GB/s	92%	138×

🌟 每一步优化省在哪里:

V0 → V1:从串行 atomic 到并行规约,消除竞争
V1 → V2:让 tid 连续工作,消除 divergence
V2 → V3:每线程更多工作,摊薄 launch 开销
V3 → V4:Warp 内同步免费,减少 sync 次数
V4 → V5:展开循环,减少寄存器和指令数
V5 → V6:跳过 Shared Memory,用寄存器替代

✅ 自我检验清单

七版本默写:能从 V0 到 V6 依次写出每个版本的核心代码,并解释优化点
Bank Conflict 直觉:能解释 V1 的树形归约访问 s[tid] 和 s[tid+stride] 是否冲突,以及 V2 为什么改善
Warp 同步:能解释为什么 V4 中 stride <= 32 后可以省 __syncthreads()(以及 Volta 后需要 __syncwarp())
Warp Shuffle:能写出 __shfl_xor_sync 的 32 线程归约
Grid-stride loop:能解释 V6 中循环的作用,以及为什么可以让 grid 数小于”刚好覆盖 N”
吞吐预测:给定一个 Reduce 算子和 GPU 型号,能预估理论上限和实测应该达到多少
Profile 验证:用 Nsight Compute 验证某版本的瓶颈是 memory 还是 compute
推广到其他规约:把 V6 改造成 max / argmin 算子

📚 参考资料

NVIDIA Optimizing Parallel Reduction in CUDA (Mark Harris) —— 经典 PPT,本章蓝本
PeakCrosser:CUDA Reduce 算子优化 —— 知乎超详尽实战
NVIDIA Blog:Faster Parallel Reductions on Kepler
CUB Library:NVIDIA 官方 BlockReduce / WarpReduce 模板,工业级参考
Cooperative Groups:CUDA 9+ 的同步抽象,适合更复杂的归约模式

搜索