Naïve browser GPU pipelines waste 92%+ of their time on dispatch overhead — sending tasks one by one instead of all at once. Kernel fusion eliminates that. I shipped it for WebGPU and measured the result across 92 unique devices and 7 GPU vendors.
No jargon. Here's the intuition.
Eager browser GPU pipelines (TF.js, hand-written WebGPU loops) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math. Compilers like TVM, XLA, and Burn fuse some of this — but rarely the whole graph, and the WebGPU backend is the least-tuned target across the board.
Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.
Two preprints, then 92 unique devices ran it across 7 GPU vendors. Median speedups (the typical experience): Apple Silicon 71×, NVIDIA 56×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Tested across Chrome, Firefox, Safari on macOS, Windows, Linux, Android, and iOS.
213,000 tokens per second peak on a phone. 15,000 average across all mobile devices. No Python, no CUDA, no cloud. A browser tab outperforms PyTorch on the same hardware.
Not theoretical. Here's what's different tomorrow.
ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.
Your GPU was idle 92%+ of the time. The waiting is eliminated. Same GPU, 20-71× faster on typical devices, peaking 226-402× on best.
Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.
Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.
Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.
The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.
A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.
A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.
Browser-based AI assistants could respond 20-71× faster on typical devices (peaks 226× on Apple Silicon, 402× on NVIDIA). Not by buying better hardware — by fixing how the software talks to your GPU.
Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.
Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.
Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.
Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.
3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.
The papers measured on 2 machines. The real world has hundreds of different GPUs. Here's why that matters.
An Apple M2 Pro laptop and a Tesla T4 server. Both are fast desktop/server GPUs with efficient command dispatching. The paper measured 159–720× speedup on those machines.
Phones, tablets, Chromebooks, gaming rigs, office laptops — across 7 GPU vendors and 4 operating systems. Devices with GPUs that were never designed for compute workloads. These GPUs have much worse dispatch overhead than the ones in the paper.
Kernel fusion eliminates dispatch overhead. The mechanism holds across every vendor we've tested. Median speedup (the typical experience): NVIDIA 56×, Apple Silicon 71×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Peak observed: 402× on NVIDIA, 226× on Apple Silicon, 103× on Qualcomm. Smaller in relative terms on phones because absolute throughput is lower — but still an order of magnitude faster than unfused.
Kernel fusion isn't new. Here's the lineage and where this work fits.
TVM and MLC-LLM compile Python models through graph-level fusion down to WebGPU kernels. WebLLM uses this stack to ship LLMs in the browser. This work fuses the entire autoregressive decoder by hand in WGSL — no Python toolchain — and ablates the limit of single-dispatch fusion.
Rust-side tensor-op stream fusion targeting CUDA, Metal, ROCm, Vulkan, and WGPU. Reported up to 78× on the WGPU backend for elementwise operators. This work targets WebGPU directly without the Rust compilation step.
WebGPU execution provider with graph-level node fusion (Conv+Add and similar patterns). Production-grade, used by Transformers.js. Fuses operator patterns; this work fuses the full transformer block.
Cross-vendor WebGPU dispatch overhead study across 4 GPU vendors, 3 backends, 3 browsers. Found 53% throughput improvement from fusion on Vulkan and no benefit on CUDA. Per-dispatch API overhead 24–71µs. This work covers 92 devices and 7 vendors, and fuses the entire decoding loop into one dispatch.
First paper to explicitly target WebGPU for browser DL inference, via JIT kernel generation. Optimizes individual kernels but does not fuse them. Establishes the WebGPU compute path; this work uses fusion on top.
Run the benchmarks on your hardware, right now.
Every result from every device is public. No cherry-picking. Verify any claim yourself.