GPU frameworks waste 92% or more of their time.
I fixed it.

Kernel fusion eliminates per-dispatch overhead by packing entire computations into single GPU instructions. 592 real-world devices tested — up to 2,865× on Apple Silicon, 623× on phones. Zero installation. Any browser.

The insight

How frameworks work

dispatch

step 1 → waitdispatch step 2 → wait

... × 1,500 steps = 22,500 round-trips

92%+ of time = waiting, not computing

Kernel fusion

dispatch once

→ GPU loops internally

1,500 steps in 1 round-trip

100% of time = computing

Live Demos
Research
New

Browser-to-Browser Distributed Evolution

WebRTC P2P Genome Exchange for Island-Model Optimization

+11.5%
fitness improvement (4 islands, p=0.015)
+14.6%
cross-platform (RTX 3090, N=5)
0
install required

Browser tabs form evolutionary islands, exchanging elite genomes directly via WebRTC data channels. A 113-line signaling relay brokers the handshake; all genome data flows peer-to-peer. Private rooms by default. Validated on Rastrigin (N=30), across Apple Metal and NVIDIA Vulkan.

Published

Single-Kernel Fusion for Sequential Fitness Evaluation

via WebGPU Compute Shaders

720×
CUDA over PyTorch (same T4)
159×
WebGPU over PyTorch (same M2)
4
GPU APIs confirmed

Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.

Published

Single-Kernel Fusion for Autoregressive Transformer Decoding

via WebGPU Compute Shaders

458×
parallel kernel vs unfused (D=256)
161×
over PyTorch MPS (D=32)
16K
tokens/sec in the browser

Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.

Real World Results
592Runs · 8 GPU Vendors

Real-world speedups are larger than the paper

Since publishing, 592 people have run the benchmarks on their own devices.

2,865×
Apple Silicon avg
peak 79,021×
623×
Qualcomm Adreno avg
peak 13,541×
79×
NVIDIA avg
peak 402×
56×
ARM Mali avg
peak 120×

Mobile transformer runs: 36 confirmed across iOS Safari and Android Chrome. Peak: 213,000 tokens/sec on a phone. Average: 15,000 tokens/sec.

Browser coverage: Chrome (347), Firefox (69), Safari (62). macOS, Windows, Linux, Android, iOS. No installation on any of them.

🚀

Why not run it on your device?

30 seconds. No installation. Your result joins the live dataset above.

Every result is public

We don't cherry-pick results. Every benchmark run from every device is published in a searchable, sortable, downloadable dataset. GPU name, score, browser, OS, timestamp — all of it. No data is hidden. Verify any claim yourself.

Browse all 592results →
SDK
npm package

@wgpu-fusion/core

One import. One dispatch. All tokens, all layers, all operations fused into a single GPU kernel.

npm install @wgpu-fusion/core

// 3 lines to benchmark your GPU

import { FusedTransformer } from '@wgpu-fusion/core'

const model = await FusedTransformer.create({ dModel: 128, nHeads: 2, nLayers: 4 })

const stats = await model.benchmark({ runs: 10 })

2,865×
Apple Silicon avg
623×
Android phones avg
0
install required

TypeScript. f32 and f16 precision. Int4 quantization. Single-thread and parallel (64-thread shared memory) modes. Works in Chrome, Firefox, Safari — any WebGPU-capable browser.

Applied

The single-kernel fusion pattern generalizes. Two projects apply it to different domains — LLM inference and radiobiology Monte Carlo — and one runs the companion benchmark suite.

Ahmet Baris Gunaydin

Independent Researcher