AI Infra Resource Map

There is only one core thesis: the essence of every optimization is to trade one resource dimension for another. Only once you can see the full list of resources and their physical carriers can you judge what a trick is really swapping, and when it should not be used at all.

I. Three mental models

Model 1: resources have only three dimensions

Any hardware resource has, in essence, only three consumable or contended dimensions:

Dimension	Meaning	Bottleneck symptom
Capacity (byte)	how much fits at once	OOM, KV cache does not fit
Bandwidth (byte/s, FLOP/s)	how much is moved/computed per unit time	memory-bound, bandwidth saturated
Latency (s)	how long one operation waits	serial dependency, bubbles, launch stalling the CPU

Latency usually cannot be eliminated directly. It can only be hidden with concurrency, and that is the whole of Model 3.

Model 2: Roofline, where compute meets bandwidth

\[\text{AI} = \frac{\text{FLOPs}}{\text{bytes moved}}, \qquad \text{achievable} = \min\big(\text{Peak FLOPS}, \ \text{AI} \times \text{Peak BW}\big)\]

The ridge point is $\text{AI}^{} = \text{Peak FLOPS} / \text{Peak BW}$. Below it ($\text{AI} < \text{AI}^{}$) you are memory-bound; above it, compute-bound. The H100’s ridge is about $295$ FLOP/byte and the B200’s about $281$, barely moving over the years. Because compute grows faster than bandwidth, the ridge rises rather than falls, pushing more and more kernels into the memory-bound region. This is the quantitative statement of the “memory wall.”

Large square matrices have high AI and are compute-bound; thin-long matrices and small-batch decode have low AI and are memory-bound. LLM decode is inherently memory-bound: each token requires reading all weights once, so $\text{AI} \approx \text{batch size}$, and only a large enough batch enters the compute-bound regime. That is exactly the physical motivation behind continuous batching and speculative decoding.

Model 3: hide latency with concurrency (Little’s Law)

\[\text{concurrency needed} = \text{latency} \times \text{throughput}\]

Multi-warp hides memory latency, multi-stream hides kernel gaps, async copy hides H2D/D2H, overlap hides collective latency, double buffering hides load latency, and pipelining hides cross-stage dependencies. They are all structurally identical, all a matter of “do B while waiting for A.” The key corollary is that hiding latency only pays off when the hidden thing is actually the bottleneck. If communication is only 5% of the time, hiding it into compute saves at most 5%, and it may even net-lose by contending for SMs and registers or by splitting kernels. Always profile the bottleneck’s share first, then decide whether to hide it.

II. Compute and storage units

CUDA Core vs Tensor Core. The former is a scalar ALU; the latter is a matrix MMA unit that computes a whole small tile per instruction, with one-to-two orders of magnitude higher throughput, born only for GEMM/convolution.

Dimension	CUDA Core	Tensor Core
Primitive	scalar FMA	matrix MMA (one tile per instruction)
Use	LayerNorm / Softmax / RoPE / sampling	all large GEMM, attention
Throughput (H100)	FP32 ~67 TF	BF16 ~990 TF, FP8 ~1979 TF
Interface	ordinary CUDA / SIMT	wmma / wgmma, cuBLAS / CUTLASS

An SM also contains independent units that can run in parallel with the main compute, and these form the hardware basis for overlap:

Unit	Function
SFU	transcendentals (exp/sin/rsqrt), without occupying CUDA Cores
Copy Engine / DMA	H2D/D2H copy, independent of the SM, so it can copy while computing
TMA (Hopper+)	async bulk global↔shared copy, with addresses computed in hardware
TMEM (Blackwell+)	dedicated Tensor Core accumulator memory, relieving register pressure
Transformer Engine	dynamically manages per-tensor scale for FP8/FP4

The memory hierarchy is the optimization-space pyramid (higher means faster and smaller). Keeping data one level higher for one more reuse is the essence of most kernel optimization.

Layer	Location	Magnitude	Latency
Register file	per-thread	KB	~0
Shared Memory / L1	per-SM	hundreds of KB	20–30 cycles
L2	whole-card shared	tens of MB	varies
HBM / GDDR	whole card	tens–hundreds of GB	hundreds of cycles, 3–8 TB/s
Host DRAM	via PCIe / NVLink-C2C	large	64 / 900 GB/s
Remote GPU / NVMe	NVLink / IB / storage	huge	offload last resort

III. The full list of schedulable resources

The left column is what you schedule at the framework / training-stack level; the right is the physical carrier and its bottleneck signal.

Abstract resource	Hardware carrier	Bottleneck signal
GEMM compute	Tensor Core	high Tensor-pipeline utilization
scalar/vector compute	CUDA Core + SFU	elementwise kernels dominate time
VRAM bandwidth	HBM↔L2↔L1↔registers	DRAM near peak, Tensor idle
VRAM capacity	HBM / GDDR	OOM, batch limited
on-chip memory	SMEM / registers / L2	occupancy suppressed
concurrency / occupancy	warp slots, scheduler	active warps ≪ max
kernel launch	CPU main thread + driver	large gaps in the GPU timeline
host compute	CPU cores	CPU saturated, GPU waits on CPU
H2D / D2H	Copy Engine + PCIe / C2C	copy and compute not overlapped
intra-node interconnect	NVLink / NVSwitch / IF	NVLink bandwidth saturated
inter-node interconnect	NIC (IB/RoCE) + RDMA	cross-node collectives dominate
collective communication	NCCL / RCCL / HCCL	AllReduce / All2All tail latency
stream parallel / async	CUDA Stream + Event	single-stream serialization
L2 residency	L2 (settable persistence)	low L2 hit rate
power / thermal	TDP, clock	frequency throttled
card partitioning	MIG, SR-IOV	multi-tenant contention

A few easily overlooked points stand out. The CPU launch thread is a real bottleneck during decode, where each step issues tens to hundreds of tiny kernels, and that is the entire reason CUDA Graph exists. The Copy Engine and the SMs are separate hardware, so copying does not consume compute, which is the physical prerequisite for “copy while computing.” L2 can be actively pinned with cudaAccessPolicyWindow, so it is not a fully automatic black box. NVLink-C2C (900 GB/s coherent memory) is far faster than PCIe (64 GB/s), so offload strategy differs accordingly. SHARP lets the switch do in-network reduction, completing half of an AllReduce’s additions inside the network. Power is a whole-card shared budget, so a fully loaded GEMM triggers downclocking. The register file is a hard constraint on occupancy, since using too much per thread leaves fewer warps resident.

IV. The trick panorama: what each one trades

Arranged by acting layer, from low-level kernel to macro-level serving. The interesting columns are the last two: which bottleneck it removes, and when it is instead harmful.

Trick	Layer	Target resource	When useless / harmful
Warp shuffle	Kernel	on-chip latency	cross-warp still needs SMEM
SMEM tiling	Kernel	VRAM bandwidth (↑AI)	suppresses occupancy
Vectorized memory access	Kernel	LD/ST issue	needs alignment, tail handling
cp.async / TMA	Kernel	hide memory latency	cannot hide if compute is too little
wgmma / tcgen05	Kernel	Tensor utilization	needs TMA to feed it
Kernel fusion	Kernel	bandwidth + launch	fusing too large causes register overflow
FlashAttention	Kernel	bandwidth + capacity	little benefit for very short sequences
CUDA Graph	Runtime	launch latency	dynamic shapes hard to capture
Persistent kernel	Runtime	launch + scheduling	load imbalance idles SMs
Multi-stream	Runtime	hide gaps	adds complexity without real parallelism
async copy + pinned	Runtime	hide H2D/D2H	no benefit off the critical path
Double buffering	Runtime	hide load latency	consumes extra capacity
Sync elimination	Framework	sync bubbles	wrongly removing breaks correctness
Mixed precision	Algorithm	compute + bandwidth + capacity	norm/gating must keep precision
FP8 / FP4 quantization	Algorithm	compute + bandwidth + capacity	useless if hardware unsupported
INT8 / weight-only	Algorithm	bandwidth + capacity	outliers, precision regression
KV-cache quantization	Serving	capacity + bandwidth	sensitive for long context
Tensor Parallel	Parallel	single-card compute + capacity	AllReduce per layer, needs fast interconnect
Pipeline Parallel	Parallel	capacity + cross-node bandwidth	pipeline bubbles
DP + ZeRO	Parallel	capacity (sharding)	insufficient bandwidth drags training
Expert Parallel	Parallel	capacity + compute (MoE)	if All2All is not the bottleneck, overlap is negative
Sequence / Context Parallel	Parallel	capacity (ultra-long sequence)	introduces extra communication
Compute–comm overlap	Parallel	hide collectives	net loss if communication is not the bottleneck
Continuous batching	Serving	compute utilization	no batch to assemble at very low concurrency
PagedAttention	Serving	capacity (defragmentation)	small indexing overhead
Chunked prefill	Serving	prefill/decode balance	too-fine chunks add scheduling overhead
Prefix / Radix cache	Serving	compute + bandwidth (reuse)	no benefit if prefix is not shared
PD disaggregation	Architecture	decouple compute vs bandwidth	KV transferred across nodes, interconnect must be fast
Speculative decoding	Algorithm	decode (memory-bound)	low acceptance or already compute-bound gives small benefit
MTP / Medusa / EAGLE	Algorithm	same (built-in draft head)	training cost, acceptance depends on task
2:4 structured sparsity	Algorithm	Tensor compute	needs sparsification training, precision may drop

Mapping back to the three models: bandwidth-savers (fusion / Flash / tiling / quantization) raise AI, pushing memory-bound kernels toward compute-bound; latency-hiders (stream / async / overlap / PP) are Little’s Law, valid only if the hidden thing is the bottleneck; launch/sync reducers (CUDA Graph, persistent kernels, sync elimination) remove CPU-side and synchronization serialization; compute-changers (quantization / Tensor Core / sparsity) raise peak FLOPS or lower the byte cost per FLOP; and distributed tricks (TP/PP/DP/EP/SP) trade communication for single-card capacity and compute.

V. Real vendor hardware (mid-2026)

A note on reading specs: compute figures here default to dense values. Vendor spec sheets often default to 2:4 sparsity, roughly double the dense number. NVIDIA’s “FP4 20 PFLOPS” for B200 is the sparse value, and the dense figure is about 9 to 10. Always align conventions when comparing across vendors.

NVIDIA data center

Model	Arch	BF16	FP8	Memory	Bandwidth
H100 SXM	Hopper	989 TF	1,979 TF	80 GB HBM3	3.35 TB/s (NVLink4 900)
H200	Hopper	989	1,979	141 GB HBM3e	~4 TB/s
H20 (China SKU)	Hopper	~148	~296	96 GB HBM3	~4 TB/s
B200	Blackwell	2,250	4,500	192 GB HBM3e	~8 TB/s
B300	Blackwell	n/a	~7,000	288 GB HBM3e	~8 TB/s
GB200	Blackwell	2×B200	n/a	384 GB + Grace	~16 TB/s class

The H20 is counterintuitive: its compute is cut to about 15% of the H100, yet its 4 TB/s bandwidth is actually higher, so for memory-bound decode its value is not bad at all, a direct projection of the Roofline. Blackwell’s 5th-gen Tensor Core adds FP4/FP6 microscaling, and the 2nd-gen Transformer Engine manages low-precision scales automatically. The roadmap continues with Vera Rubin NVL144 (2026) at 3.6 EFLOPS FP4 inference and HBM4 at 13 TB/s, then Rubin Ultra NVL576 (2027) at 15 EFLOPS FP4.

NVIDIA workstation / consumer

Model	CUDA Cores	Memory	Bandwidth	NVLink	TDP
RTX PRO 6000 Blackwell	24,064	96 GB GDDR7 ECC	1.79 TB/s	none	600 W
RTX 5090	21,760	32 GB GDDR7	1.79 TB/s	none	575 W

Neither supports NVLink, so multi-card setups fall back to PCIe 5.0 (~64 GB/s one-way). For local TP/EP experiments, communication easily becomes the bottleneck, which is exactly what the resource map predicts.

Note: these notes are compiled from sources on the internet and are not my original work. I plan to rewrite them in my own words later.

AI Infra Resource Map

Every trick is a trade

AI Infra Resource Map

Every trick is a trade

I. Three mental models

Model 1: resources have only three dimensions

Model 2: Roofline, where compute meets bandwidth

Model 3: hide latency with concurrency (Little’s Law)

II. Compute and storage units

III. The full list of schedulable resources

IV. The trick panorama: what each one trades

V. Real vendor hardware (mid-2026)