Two research preprints

Your GPU is fast enough.
The software isn't.

GPU frameworks waste 92% or more of their time on overhead — sending tasks one by one instead of all at once. I proved it, fixed it, and 592 people confirmed it on their own devices.

2,865×

Apple Silicon average 592 real-world runs

623×

Android phones average Qualcomm Adreno

things to install just open Chrome

How it works (simply)

No jargon. Here's the intuition.

The problem: 92%+ overhead

GPU frameworks (PyTorch, JAX, WebLLM) send one small task to the GPU, wait for it to finish, send the next one. For a 64-token generation with 4 layers, that's 1,024 separate round-trips. Each round-trip takes longer than the actual math.

The fix: one dispatch

Pack the entire computation — all tokens, all layers, all operations — into a single GPU instruction. The GPU loops internally. No round-trips. No waiting. Same math, same result.

The proof: 592 devices

Two preprints, then 592 people ran it on their own hardware. Apple Silicon averages 2,865×. Android phones average 623×. NVIDIA desktops average 79×. Tested across Chrome, Firefox, Safari on macOS, Windows, Linux, Android, and iOS.

The result: AI on a phone

213,000 tokens per second peak on a phone. 15,000 average across all mobile devices. No Python, no CUDA, no cloud. A browser tab outperforms PyTorch on the same hardware.

What actually changes

Not theoretical. Here's what's different tomorrow.

Before

ChatGPT in your browser types 5 words per second. You assume your laptop isn't powerful enough.

→

After

Your GPU was idle 92%+ of the time. The waiting is eliminated. Same GPU, 56-2,865× faster.

Before

Running AI locally means installing Python, CUDA, PyTorch, downloading model weights, debugging driver conflicts.

→

After

Open a browser tab. That's it. The AI runs on the GPU you already have, at near-native speed.

Before

Every AI feature costs $2-4/hour in cloud GPU. 100K users = $50K/month in servers.

→

After

The user's GPU does the work. Server cost: $0. The browser IS the infrastructure.

Before

A student in rural India can't afford a GPU cluster or cloud API credits to learn AI.

→

After

A $300 phone with Chrome can run transformer inference locally. No internet needed after model download.

Who this is for

💬

Anyone who uses AI chatbots

Browser-based AI assistants could respond 56-2,865× faster. Not by buying better hardware — by fixing how the software talks to your GPU.

🏫

Teachers and students

Run AI models live in the classroom. Every student's laptop becomes an AI workstation. No lab, no cloud account, no IT department.

🔬

AI researchers

Ship a live demo of your model as a URL. Reviewers run it in their browser instead of fighting with your Docker container.

🚀

Startups building AI products

Add AI features to your web app without GPU servers. Your users' devices do the compute. Scale to millions at zero marginal cost.

🔒

Privacy-sensitive industries

Healthcare, legal, finance — the AI runs on the device. Data never leaves the laptop. Compliance by architecture.

🌍

The developing world

3 billion people have a WebGPU-capable device. Browser-native AI makes intelligence a capability your device already has, not a service you rent.

Why are the real-world numbers bigger than the paper?

The papers measured on 2 machines. The real world has hundreds of different GPUs. Here's why that matters.

The paper tested on 2 devices

An Apple M2 Pro laptop and a Tesla T4 server. Both are fast desktop/server GPUs with efficient command dispatching. The paper measured 159–720× speedup on those machines.

592 people ran it on everything else

Phones, tablets, Chromebooks, gaming rigs, office laptops. Devices with GPUs that were never designed for compute workloads. These GPUs have much worse dispatch overhead than the ones in the paper.

Worse overhead = bigger gain from fusion

Kernel fusion eliminates dispatch overhead. So the worse a device is at dispatching, the more it benefits. NVIDIA desktop GPUs (good dispatching) see ~79×. Apple Silicon laptops see ~2,865×. Android phones (Qualcomm Adreno) see ~623×. This is not a bug — it's the point. The devices that need fusion most, benefit from it most.

See it for yourself

Run the benchmarks on your hardware, right now.

Transformer Benchmark GPU Compute Benchmarks All Results (Open Data)

Every result from every device is public. No cherry-picking. Verify any claim yourself.