Note: there are many kernel experts in the world. I am not one of them. This post probably contains many errors, but the premise is directionally correct. If you are one of those kernel experts, I would love to learn what I missed!
I joined Meta’s MTIA PM team a few months ago to work on 1P accelerators (think Meta’s version of the Google TPU program) knowing essentially nothing about silicon beyond the snippets I gleaned from Semianalysis gossip. Within days of joining, I was inundated with complaints about kernels. What the hell was a kernel? I honestly didn’t really know.
For those not in the know, a kernel is how an operator is implemented on a piece of hardware. For example, PyTorch has 841 ATenOps. These are mathematical operations like max_pool2d_with_indices and _flash_attention_backward, which you will recognize as the core components of a neural network. These functions need to be mapped to actual operations on the hardware like loading two numbers into registers, adding them lane-wise, and reading the result.
CPUs are flexible (and mature GPUs to some extent) and can generally handle arbitrary mathematical operations without effort from the user. Hardware idiosyncrasies are abstracted away, making programming easy, but this approach can be slow. Because training and running neural networks is compute- and memory-intensive, writing hardware- and problem-specific kernels for accelerators to trade speed for generalization often makes sense.
I wanted to understand what it took to write a new kernel from first principles, so I picked two functions to play with.
- [memory bound] Rotary Positional Embeddings (RoPE) are used to help transformers understand sequences by rotating the input according to its position. This function involves 4 atomic PyTorch ops (2 mul, 1 add, 1 rotate), so a naive implementation would be memory-bound as values shuttle in and out for each operation. A kernel team focused on speeding this up would want to both fuse the operations into a single set of hardware instructions to keep the values in memory and to a lesser extent optimize the math for the hardware.
- [compute bound] GeLU + linear, gelu(W @ x + b), is the function used to compute a single FFN layer proceeding the attention block in a transformer. A naive implementation requires 3 atomic PyTorch ops (1 matmul, 1 element-wise add, 1 element-wise GeLU), but unlike RoPE, the matmul operation is very compute intensive, so rather than focus on fusion to avoid memory bottlenecks, an optimized kernel would first optimize the math for the hardware.
Note: I selected these functions because they are commonly used in machine learning and PyTorch does not already offer these fused kernels. However, in practice RoPE is already fused into FlashAttention and GeLU + linear into GEMM epilogue, so these are just toy examples and not representative of some crazy production speedup.
I asked Claude Code to write me several implementations (kernels in our new language) of both of these functions to run on my Apple M1 (I know I mentioned that CPUs are flexible and don’t usually need custom kernels, but type of silicon is irrelevant to this learning exercise).
- Pure Python. This represents the worst-case scenario involving looping through all the vector operations. This is a very bad way to write a kernel but I wanted to understand the floor.
- Vectorized Numpy. Hand-coded vectorized operations.
- Naive PyTorch. This involves only a few lines of code and calling 3 or 4 separate PyTorch ops before returning the result.
- Unfused C++. Each naive PyTorch operation is written in optimized C++ to maximally leverage my Mac M1 CPU (and giving up some generalization provided by PyTorch).
- Fused C++. All 3 or 4 operations are fused to eliminate memory shuffling in addition to maximally leveraging CPU.
- Fused Metal. All 3 or 4 operations are fused and sent to the M1 onboard GPU.
- MPS/ANE. Unfused naive PyTorch operations are sent to ultraoptimized Apple kernels.
Somewhat surprisingly, Claude Code one-shotted this entire task once I realized what I was looking for. My C++ and Metal knowledge is too weak to independently review the code, but I walked through each function with Claude’s help and am 90% confident that they are correct.
Beyond seeing how the kernels were structured and how accessible writing a “good enough” kernel actually is, the results were quite interesting.
RoPE
We can see that RoPE is obviously memory bound.
When we move from stupid silly compute in pure python to less silly in NumPy (single-threaded) to reasonable in PyTorch Eager (our 4 unfused kernels), we see big improvements. But we actually lose performance by shifting the backend from CPU to Apple’s ultra optimized MPS GPU because cost in slightly slower data transfer is greater than the MPS speedup for the RoPE operations.
We really cook when we start fusing kernels, eliminating memory shuffling entirely. The unfused C++ kernel runs similarly to PyTorch eager (slightly faster because of framework overhead and skipping safety checks), but the fused kernel runs 5x faster because we attacked the root cause of our slowdown: memory bandwidth.
If we really want to eek out every last bit of performance, we can fuse and move to GPU by writing our own native Metal kernel and get the best of both worlds.

GeLU + linear
GeLU + linear is a different story. We only see a 2% speedup due to fusing kernels. For a matrix this large, the vast majority of the time is spent on matmul, so all that matters is making that one operation faster.
However, this is a tale of two cities. We see that our custom Metal GPU implementation is actually far slower than the optimized CPU kernels. This is a good example of how the right kernel done the wrong way can actually hurt you. However, if we push optimized unfused PyTorch kernels to the GPU using MPS (Apple’s kernels) or ANE (Apple Neural Engine, another chip), we improve our performance nearly 2x and 2x again.
But why is ANE so much faster than MPS…

Pushing the limits
What we’ve seen so far is vanilla kernel optimization. Fusing fusible kernels and pushing compute to GPU is obvious. Can we do more if we give up the generalization required by PyTorch? The answer is definitely yes.
What if we really wanted to start cooking? I had a few ideas.
- FP16 only. Drop support for other quantization schemes. This is actually what ANE does and why it is 2x faster than MPS at fp32.
- Inference only. Drop support for gradients.
- Fixed shapes. Drop support for arbitrary model architectures.
- Moar?
It turns out if we implement all of these, we can squeeze out another doubling or so in performance, getting us from our best RoPE score of 1.03 ms to 0.59 ms and our best GeLU + linear of 14.5 ms (already FP16 in ANE; apples-to-apples quant is 32.2 ms) to 12.3 ms.
Can we scale this limit pushing across tasks?
Yes??? Generalization comes at the expense of speed. In a world where each kernel needs to be hand crafted, performance and effort must be carefully balanced. In a world where AI can build and verify kernels at scale, it should be possible to automate the creation of a library of kernels customized for all hardware and model types.
And in fact this is happening right now. Kernel creation and optimization is a strongly verifiable task (you know exactly what you are looking for from the slower version and can easily time the faster version), so models and model-enabled kernel authors are making fast progress on this.
- Standard Kernel just raised $20M do a slightly fancier version of this blog post.
- A guy named Jaber adapted Andrej Karpathy’s autoresearch to optimize kernels and pushed the gains shown in this blog post even further.
- One of the fastest kernels for nvfp4_gemm on an open leaderboard was “written” by someone with zero GPU programming experience.
I have no doubt that in the near future, extremely efficient and specific kernels will be generated on the fly enabling a new set of efficiency gains in model training and inference. You could even think about pushing every operation into a single kernel like this guy who 10x’ed throughput for Qwen3 0.6B on a 5090 through extreme kernel fusion.
Epilogue: Is this actually necessary?
You may be reading this and thinking that basic kernel fusion is the most obvious thing in the world. Wouldn’t the PyTorch team have built something like this into the platform?
The answer is yes. PyTorch eager mode runs the operations in the order they are received. torch.compile looks at the whole graph, identifies when kernels can be fused or optimized, and compiles the whole thing together. For my toy examples, torch.compile nearly achieved the perf I did with my vibe coded kernel, but I suspect that a purpose-built agentic AI tool would outperform over the long term. Byte Dance published some early results indicating their CUDA Agent crushes torch.compile already.
