Two research preprints

Your GPU is fast enough.
The software isn't.

Naïve browser GPU pipelines waste 92%+ of their time on dispatch overhead — sending tasks one by one instead of all at once. Kernel fusion eliminates that. I shipped it for WebGPU and measured the result across 92 unique devices and 7 GPU vendors.

71×
Apple Silicon median 92 unique devices, 7 vendors
20×
Qualcomm Adreno median Android phones
0
things to install just open Chrome

How it works (simply)

No jargon. Here's the intuition.

1

The problem: 92%+ overhead on naïve dispatch

Eager browser GPU pipelines (TF.js, hand-written WebGPU loops) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math. Compilers like TVM, XLA, and Burn fuse some of this — but rarely the whole graph, and the WebGPU backend is the least-tuned target across the board.

2

The fix: one dispatch

Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.

3

The proof: 92 unique devices, 7 vendors

Two preprints, then 92 unique devices ran it across 7 GPU vendors. Median speedups (the typical experience): Apple Silicon 71×, NVIDIA 56×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Tested across Chrome, Firefox, Safari on macOS, Windows, Linux, Android, and iOS.

4

The result: AI on a phone

213,000 tokens per second peak on a phone. 15,000 average across all mobile devices. No Python, no CUDA, no cloud. A browser tab outperforms PyTorch on the same hardware.

What actually changes

Not theoretical. Here's what's different tomorrow.

Before

ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.

After

Your GPU was idle 92%+ of the time. The waiting is eliminated. Same GPU, 20-71× faster on typical devices, peaking 226-402× on best.

Before

Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.

After

Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.

Before

Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.

After

The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.

Before

A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.

After

A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.

Who this is for

💬

Anyone who uses AI chatbots

Browser-based AI assistants could respond 20-71× faster on typical devices (peaks 226× on Apple Silicon, 402× on NVIDIA). Not by buying better hardware — by fixing how the software talks to your GPU.

🏫

Teachers and students

Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.

🔬

AI researchers

Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.

🚀

Startups building AI products

Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.

🔒

Privacy-sensitive industries

Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.

🌍

The developing world

3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.

Why are the real-world numbers bigger than the paper?

The papers measured on 2 machines. The real world has hundreds of different GPUs. Here's why that matters.

The paper tested on 2 devices

An Apple M2 Pro laptop and a Tesla T4 server. Both are fast desktop/server GPUs with efficient command dispatching. The paper measured 159–720× speedup on those machines.

92 unique devices ran it on everything else

Phones, tablets, Chromebooks, gaming rigs, office laptops — across 7 GPU vendors and 4 operating systems. Devices with GPUs that were never designed for compute workloads. These GPUs have much worse dispatch overhead than the ones in the paper.

The mechanism is hardware-agnostic

Kernel fusion eliminates dispatch overhead. The mechanism holds across every vendor we've tested. Median speedup (the typical experience): NVIDIA 56×, Apple Silicon 71×, ARM Mali 55×, Intel 43×, AMD 40×, Qualcomm Adreno 20×. Peak observed: 402× on NVIDIA, 226× on Apple Silicon, 103× on Qualcomm. Smaller in relative terms on phones because absolute throughput is lower — but still an order of magnitude faster than unfused.

Related work

Kernel fusion isn't new. Here's the lineage and where this work fits.

Apache TVM / MLC-LLM / WebLLM (since ~2020; WebLLM arXiv:2412.15803, Dec 2024)

TVM and MLC-LLM compile Python models through graph-level fusion down to WebGPU kernels. WebLLM uses this stack to ship LLMs in the browser. This work fuses the entire autoregressive decoder by hand in WGSL — no Python toolchain — and ablates the limit of single-dispatch fusion.

Burn / CubeCL (Tracel AI, tensor-stream fusion, Mar 2024)

Rust-side tensor-op stream fusion targeting CUDA, Metal, ROCm, Vulkan, and WGPU. Reported up to 78× on the WGPU backend for elementwise operators. This work targets WebGPU directly without the Rust compilation step.

ONNX Runtime Web (WebGPU EP) (Microsoft, v1.17, Feb 2024)

WebGPU execution provider with graph-level node fusion (Conv+Add and similar patterns). Production-grade, used by Transformers.js. Fuses operator patterns; this work fuses the full transformer block.

Maczan (arXiv:2604.02344, Feb 2026) — closest peer

Cross-vendor WebGPU dispatch overhead study across 4 GPU vendors, 3 backends, 3 browsers. Found 53% throughput improvement from fusion on Vulkan and no benefit on CUDA. Per-dispatch API overhead 24–71µs. This work covers 92 devices and 7 vendors, and fuses the entire decoding loop into one dispatch.

nnJIT (Jia et al, arXiv:2309.08978, Sept 2023)

First paper to explicitly target WebGPU for browser DL inference, via JIT kernel generation. Optimizes individual kernels but does not fuse them. Establishes the WebGPU compute path; this work uses fusion on top.

See it for yourself

Run the benchmarks on your hardware, right now.

Every result from every device is public. No cherry-picking. Verify any claim yourself.

Built by Ahmet Baris Gunaydin · Independent Researcher · All computation runs locally on your GPU