There is only one core thesis: the essence of every optimization is to trade one resource dimension for another. Only once you can see the full list of resources and their physical carriers can you judge what a trick is really swapping, and when it should not be used at all.

I. Three mental models

Model 1: resources have only three dimensions

Any hardware resource has, in essence, only three consumable or contended dimensions:

Dimension Meaning Bottleneck symptom
Capacity (byte) how much fits at once OOM, KV cache does not fit
Bandwidth (byte/s, FLOP/s) how much is moved/computed per unit time memory-bound, bandwidth saturated
Latency (s) how long one operation waits serial dependency, bubbles, launch stalling the CPU

Latency usually cannot be eliminated directly. It can only be hidden with concurrency, and that is the whole of Model 3.

Model 2: Roofline, where compute meets bandwidth

\[\text{AI} = \frac{\text{FLOPs}}{\text{bytes moved}}, \qquad \text{achievable} = \min\big(\text{Peak FLOPS}, \ \text{AI} \times \text{Peak BW}\big)\]

The ridge point is $\text{AI}^{} = \text{Peak FLOPS} / \text{Peak BW}$. Below it ($\text{AI} < \text{AI}^{}$) you are memory-bound; above it, compute-bound. The H100’s ridge is about $295$ FLOP/byte and the B200’s about $281$, barely moving over the years. Because compute grows faster than bandwidth, the ridge rises rather than falls, pushing more and more kernels into the memory-bound region. This is the quantitative statement of the “memory wall.”

Large square matrices have high AI and are compute-bound; thin-long matrices and small-batch decode have low AI and are memory-bound. LLM decode is inherently memory-bound: each token requires reading all weights once, so $\text{AI} \approx \text{batch size}$, and only a large enough batch enters the compute-bound regime. That is exactly the physical motivation behind continuous batching and speculative decoding.

Model 3: hide latency with concurrency (Little’s Law)

\[\text{concurrency needed} = \text{latency} \times \text{throughput}\]

Multi-warp hides memory latency, multi-stream hides kernel gaps, async copy hides H2D/D2H, overlap hides collective latency, double buffering hides load latency, and pipelining hides cross-stage dependencies. They are all structurally identical, all a matter of “do B while waiting for A.” The key corollary is that hiding latency only pays off when the hidden thing is actually the bottleneck. If communication is only 5% of the time, hiding it into compute saves at most 5%, and it may even net-lose by contending for SMs and registers or by splitting kernels. Always profile the bottleneck’s share first, then decide whether to hide it.

II. Compute and storage units

CUDA Core vs Tensor Core. The former is a scalar ALU; the latter is a matrix MMA unit that computes a whole small tile per instruction, with one-to-two orders of magnitude higher throughput, born only for GEMM/convolution.

Dimension CUDA Core Tensor Core
Primitive scalar FMA matrix MMA (one tile per instruction)
Use LayerNorm / Softmax / RoPE / sampling all large GEMM, attention
Throughput (H100) FP32 ~67 TF BF16 ~990 TF, FP8 ~1979 TF
Interface ordinary CUDA / SIMT wmma / wgmma, cuBLAS / CUTLASS

An SM also contains independent units that can run in parallel with the main compute, and these form the hardware basis for overlap:

Unit Function
SFU transcendentals (exp/sin/rsqrt), without occupying CUDA Cores
Copy Engine / DMA H2D/D2H copy, independent of the SM, so it can copy while computing
TMA (Hopper+) async bulk global↔shared copy, with addresses computed in hardware
TMEM (Blackwell+) dedicated Tensor Core accumulator memory, relieving register pressure
Transformer Engine dynamically manages per-tensor scale for FP8/FP4

The memory hierarchy is the optimization-space pyramid (higher means faster and smaller). Keeping data one level higher for one more reuse is the essence of most kernel optimization.

Layer Location Magnitude Latency
Register file per-thread KB ~0
Shared Memory / L1 per-SM hundreds of KB 20–30 cycles
L2 whole-card shared tens of MB varies
HBM / GDDR whole card tens–hundreds of GB hundreds of cycles, 3–8 TB/s
Host DRAM via PCIe / NVLink-C2C large 64 / 900 GB/s
Remote GPU / NVMe NVLink / IB / storage huge offload last resort

III. The full list of schedulable resources

The left column is what you schedule at the framework / training-stack level; the right is the physical carrier and its bottleneck signal.

Abstract resource Hardware carrier Bottleneck signal
GEMM compute Tensor Core high Tensor-pipeline utilization
scalar/vector compute CUDA Core + SFU elementwise kernels dominate time
VRAM bandwidth HBM↔L2↔L1↔registers DRAM near peak, Tensor idle
VRAM capacity HBM / GDDR OOM, batch limited
on-chip memory SMEM / registers / L2 occupancy suppressed
concurrency / occupancy warp slots, scheduler active warps ≪ max
kernel launch CPU main thread + driver large gaps in the GPU timeline
host compute CPU cores CPU saturated, GPU waits on CPU
H2D / D2H Copy Engine + PCIe / C2C copy and compute not overlapped
intra-node interconnect NVLink / NVSwitch / IF NVLink bandwidth saturated
inter-node interconnect NIC (IB/RoCE) + RDMA cross-node collectives dominate
collective communication NCCL / RCCL / HCCL AllReduce / All2All tail latency
stream parallel / async CUDA Stream + Event single-stream serialization
L2 residency L2 (settable persistence) low L2 hit rate
power / thermal TDP, clock frequency throttled
card partitioning MIG, SR-IOV multi-tenant contention

A few easily overlooked points stand out. The CPU launch thread is a real bottleneck during decode, where each step issues tens to hundreds of tiny kernels, and that is the entire reason CUDA Graph exists. The Copy Engine and the SMs are separate hardware, so copying does not consume compute, which is the physical prerequisite for “copy while computing.” L2 can be actively pinned with cudaAccessPolicyWindow, so it is not a fully automatic black box. NVLink-C2C (900 GB/s coherent memory) is far faster than PCIe (64 GB/s), so offload strategy differs accordingly. SHARP lets the switch do in-network reduction, completing half of an AllReduce’s additions inside the network. Power is a whole-card shared budget, so a fully loaded GEMM triggers downclocking. The register file is a hard constraint on occupancy, since using too much per thread leaves fewer warps resident.

IV. The trick panorama: what each one trades

Arranged by acting layer, from low-level kernel to macro-level serving. The interesting columns are the last two: which bottleneck it removes, and when it is instead harmful.

Trick Layer Target resource When useless / harmful
Warp shuffle Kernel on-chip latency cross-warp still needs SMEM
SMEM tiling Kernel VRAM bandwidth (↑AI) suppresses occupancy
Vectorized memory access Kernel LD/ST issue needs alignment, tail handling
cp.async / TMA Kernel hide memory latency cannot hide if compute is too little
wgmma / tcgen05 Kernel Tensor utilization needs TMA to feed it
Kernel fusion Kernel bandwidth + launch fusing too large causes register overflow
FlashAttention Kernel bandwidth + capacity little benefit for very short sequences
CUDA Graph Runtime launch latency dynamic shapes hard to capture
Persistent kernel Runtime launch + scheduling load imbalance idles SMs
Multi-stream Runtime hide gaps adds complexity without real parallelism
async copy + pinned Runtime hide H2D/D2H no benefit off the critical path
Double buffering Runtime hide load latency consumes extra capacity
Sync elimination Framework sync bubbles wrongly removing breaks correctness
Mixed precision Algorithm compute + bandwidth + capacity norm/gating must keep precision
FP8 / FP4 quantization Algorithm compute + bandwidth + capacity useless if hardware unsupported
INT8 / weight-only Algorithm bandwidth + capacity outliers, precision regression
KV-cache quantization Serving capacity + bandwidth sensitive for long context
Tensor Parallel Parallel single-card compute + capacity AllReduce per layer, needs fast interconnect
Pipeline Parallel Parallel capacity + cross-node bandwidth pipeline bubbles
DP + ZeRO Parallel capacity (sharding) insufficient bandwidth drags training
Expert Parallel Parallel capacity + compute (MoE) if All2All is not the bottleneck, overlap is negative
Sequence / Context Parallel Parallel capacity (ultra-long sequence) introduces extra communication
Compute–comm overlap Parallel hide collectives net loss if communication is not the bottleneck
Continuous batching Serving compute utilization no batch to assemble at very low concurrency
PagedAttention Serving capacity (defragmentation) small indexing overhead
Chunked prefill Serving prefill/decode balance too-fine chunks add scheduling overhead
Prefix / Radix cache Serving compute + bandwidth (reuse) no benefit if prefix is not shared
PD disaggregation Architecture decouple compute vs bandwidth KV transferred across nodes, interconnect must be fast
Speculative decoding Algorithm decode (memory-bound) low acceptance or already compute-bound gives small benefit
MTP / Medusa / EAGLE Algorithm same (built-in draft head) training cost, acceptance depends on task
2:4 structured sparsity Algorithm Tensor compute needs sparsification training, precision may drop

Mapping back to the three models: bandwidth-savers (fusion / Flash / tiling / quantization) raise AI, pushing memory-bound kernels toward compute-bound; latency-hiders (stream / async / overlap / PP) are Little’s Law, valid only if the hidden thing is the bottleneck; launch/sync reducers (CUDA Graph, persistent kernels, sync elimination) remove CPU-side and synchronization serialization; compute-changers (quantization / Tensor Core / sparsity) raise peak FLOPS or lower the byte cost per FLOP; and distributed tricks (TP/PP/DP/EP/SP) trade communication for single-card capacity and compute.

V. Real vendor hardware (mid-2026)

A note on reading specs: compute figures here default to dense values. Vendor spec sheets often default to 2:4 sparsity, roughly double the dense number. NVIDIA’s “FP4 20 PFLOPS” for B200 is the sparse value, and the dense figure is about 9 to 10. Always align conventions when comparing across vendors.

NVIDIA data center

Model Arch BF16 FP8 Memory Bandwidth
H100 SXM Hopper 989 TF 1,979 TF 80 GB HBM3 3.35 TB/s (NVLink4 900)
H200 Hopper 989 1,979 141 GB HBM3e ~4 TB/s
H20 (China SKU) Hopper ~148 ~296 96 GB HBM3 ~4 TB/s
B200 Blackwell 2,250 4,500 192 GB HBM3e ~8 TB/s
B300 Blackwell n/a ~7,000 288 GB HBM3e ~8 TB/s
GB200 Blackwell 2×B200 n/a 384 GB + Grace ~16 TB/s class

The H20 is counterintuitive: its compute is cut to about 15% of the H100, yet its 4 TB/s bandwidth is actually higher, so for memory-bound decode its value is not bad at all, a direct projection of the Roofline. Blackwell’s 5th-gen Tensor Core adds FP4/FP6 microscaling, and the 2nd-gen Transformer Engine manages low-precision scales automatically. The roadmap continues with Vera Rubin NVL144 (2026) at 3.6 EFLOPS FP4 inference and HBM4 at 13 TB/s, then Rubin Ultra NVL576 (2027) at 15 EFLOPS FP4.

NVIDIA workstation / consumer

Model CUDA Cores Memory Bandwidth NVLink TDP
RTX PRO 6000 Blackwell 24,064 96 GB GDDR7 ECC 1.79 TB/s none 600 W
RTX 5090 21,760 32 GB GDDR7 1.79 TB/s none 575 W

Neither supports NVLink, so multi-card setups fall back to PCIe 5.0 (~64 GB/s one-way). For local TP/EP experiments, communication easily becomes the bottleneck, which is exactly what the resource map predicts.


Note: these notes are compiled from sources on the internet and are not my original work. I plan to rewrite them in my own words later.