GPU frameworks waste 92% or more of their time on overhead — sending tasks one by one instead of all at once. I proved it, fixed it, and 592 people confirmed it on their own devices.
No jargon. Here's the intuition.
GPU frameworks (PyTorch, JAX, WebLLM) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math.
Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.
Two preprints, then 592 people ran it on their own hardware. Apple Silicon averages 2,865×. Android phones average 623×. NVIDIA desktops average 79×. Tested across Chrome, Firefox, Safari on macOS, Windows, Linux, Android, and iOS.
213,000 tokens per second peak on a phone. 15,000 average across all mobile devices. No Python, no CUDA, no cloud. A browser tab outperforms PyTorch on the same hardware.
Not theoretical. Here's what's different tomorrow.
ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.
Your GPU was idle 92%+ of the time. The waiting is eliminated. Same GPU, 56-2,865× faster.
Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.
Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.
Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.
The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.
A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.
A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.
Browser-based AI assistants could respond 56-2,865× faster. Not by buying better hardware — by fixing how the software talks to your GPU.
Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.
Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.
Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.
Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.
3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.
The papers measured on 2 machines. The real world has hundreds of different GPUs. Here's why that matters.
An Apple M2 Pro laptop and a Tesla T4 server. Both are fast desktop/server GPUs with efficient command dispatching. The paper measured 159–720× speedup on those machines.
Phones, tablets, Chromebooks, gaming rigs, office laptops. Devices with GPUs that were never designed for compute workloads. These GPUs have much worse dispatch overhead than the ones in the paper.
Kernel fusion eliminates dispatch overhead. So the worse a device is at dispatching, the more it benefits. NVIDIA desktop GPUs (good dispatching) see ~79×. Apple Silicon laptops see ~2,865×. Android phones (Qualcomm Adreno) see ~623×. This is not a bug — it's the point. The devices that need fusion most, benefit from it most.
Run the benchmarks on your hardware, right now.
Every result from every device is public. No cherry-picking. Verify any claim yourself.