Victoria Zhang | Tutorials

Vision and Language

A double helix

Posted on June 28, 2026

Why redraw the map [Read More]

Tags: tutorials vision multimodal deep-learning notes

Gated Attention

One switch after softmax

Posted on June 27, 2026

A simple modification ignored for eight years [Read More]

Tags: tutorials llm attention architecture notes

There is only one core thesis: the essence of every optimization is to trade one resource dimension for another. Only once you can see the full list of resources and their physical carriers can you judge what a trick is really swapping, and when it should not be used at... [Read More]

Tags: tutorials llm gpu infra performance notes

Why Decode Is Slow

Shape, not the KV cache

Posted on June 25, 2026

A widely misunderstood bottleneck [Read More]

Tags: tutorials llm inference roofline performance notes

When Is SFT Done?

Reading the signals

Posted on June 24, 2026

In LLM post-training, SFT (supervised fine-tuning) usually handles cold-start, behavior formatting, and task-protocol learning. But longer SFT is not always better, and lower loss is not always better. Judging whether SFT is “done” hinges not on how well it fits the training set, but on whether continued supervised imitation still... [Read More]

Tags: tutorials llm post-training sft rl notes

Attention Is Not Matmul Bound

Where the time really goes

Posted on June 23, 2026

If someone asks which part of attention is slowest, the obvious guess is the matrix multiplication. It has by far the most floating-point operations, so surely it must dominate the runtime. On a modern GPU, that intuition is wrong. The two matmuls in attention are simply not where the time... [Read More]

Tags: tutorials llm gpu attention performance notes

Mixture of Experts

Experts are computation paths

Posted on June 22, 2026

The first time people meet Mixture of Experts, they instinctively assume that since it is called “experts,” surely each expert specializes in a different task. The reading is natural, but in a real MoE system it simply does not hold. [Read More]

Tags: tutorials llm moe architecture notes

Speculative Decoding

Same output, much faster

Posted on June 21, 2026

Speculative decoding (also called speculative sampling) is one of the most elegant tricks in the LLM inference-acceleration toolbox. In one line, it lets a large model run at speeds that were previously unimaginable, all without sacrificing any accuracy. [Read More]

Tags: tutorials llm inference decoding notes

The Temperature Knob

How LLM sampling really works

Posted on June 20, 2026

The last layer of a language model does not emit a probability. It emits logits, a vector of real numbers. A probability distribution only appears after we push those logits through a softmax. The entire secret of temperature lives in the step just before the softmax: we divide every logit... [Read More]

Tags: tutorials llm decoding notes