Kernel fusion eliminates per-dispatch overhead by packing entire computations into single GPU instructions. 592 real-world devices tested — up to 2,865× on Apple Silicon, 623× on phones. Zero installation. Any browser.
How frameworks work
dispatch
step 1 → wait → dispatch step 2 → wait... × 1,500 steps = 22,500 round-trips
92%+ of time = waiting, not computing
Kernel fusion
dispatch once
→ GPU loops internally1,500 steps in 1 round-trip
100% of time = computing
Watch 50 neural networks learn to play Flappy Bird in real-time. GPU evaluates 4,096 birds per dispatch via kernel fusion. Open in multiple tabs to connect via WebRTC and evolve together.
4,096-population evolutionary optimization on a 2,000-dimensional multimodal landscape. Measures raw GPU throughput of the fused evolutionary kernel.
Fused attention + FFN + LayerNorm in a single GPU dispatch. Benchmarks unfused, fused, parallel, and f16 variants across model dimensions.
WebRTC P2P Genome Exchange for Island-Model Optimization
Browser tabs form evolutionary islands, exchanging elite genomes directly via WebRTC data channels. A 113-line signaling relay brokers the handshake; all genome data flows peer-to-peer. Private rooms by default. Validated on Rastrigin (N=30), across Apple Metal and NVIDIA Vulkan.
via WebGPU Compute Shaders
Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.
via WebGPU Compute Shaders
Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.
Since publishing, 592 people have run the benchmarks on their own devices.
Mobile transformer runs: 36 confirmed across iOS Safari and Android Chrome. Peak: 213,000 tokens/sec on a phone. Average: 15,000 tokens/sec.
Browser coverage: Chrome (347), Firefox (69), Safari (62). macOS, Windows, Linux, Android, iOS. No installation on any of them.
30 seconds. No installation. Your result joins the live dataset above.
We don't cherry-pick results. Every benchmark run from every device is published in a searchable, sortable, downloadable dataset. GPU name, score, browser, OS, timestamp — all of it. No data is hidden. Verify any claim yourself.
Browse all 592results →One import. One dispatch. All tokens, all layers, all operations fused into a single GPU kernel.
npm install @wgpu-fusion/core
// 3 lines to benchmark your GPU
import { FusedTransformer } from '@wgpu-fusion/core'
const model = await FusedTransformer.create({ dModel: 128, nHeads: 2, nLayers: 4 })
const stats = await model.benchmark({ runs: 10 })
TypeScript. f32 and f16 precision. Int4 quantization. Single-thread and parallel (64-thread shared memory) modes. Works in Chrome, Firefox, Safari — any WebGPU-capable browser.
The single-kernel fusion pattern generalizes. Two projects apply it to different domains — LLM inference and radiobiology Monte Carlo — and one runs the companion benchmark suite.
“How fast is your GPU in the browser?” Open WebGPU compute benchmarks on 592 real devices — Rastrigin, N-body, Monte Carlo Pi, RL environments.
Benchmark your GPU →LLM inferencePhi-3-mini (3.8B) in the browser via 10 hand-written kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.
Run it live →RadiobiologyGeant4-DNA electron track-structure Monte Carlo in the browser. One fused dispatch, one thread per primary, full history inline. Live radiolysis chemistry + DNA damage scoring.
See the simulation →