Naïve browser GPU code wastes 92%+ of dispatch time.
Kernel fusion fixes it. I shipped it for the web.

Kernel fusion is a known technique — TVM, XLA, and Burn already fuse operator patterns. What I shipped: full transformer decode and full evolutionary fitness loops as a single WebGPU dispatch, measured across 92 real-world devices and 7 GPU vendors. Median 71× on Apple Silicon, 56× on NVIDIA, 20× on phones; peaks 226× / 402× / 103×. Zero installation. Any browser.

Flappy Evolution Demo Why this matters GPU Benchmarks

The insight

How eager dispatch works

dispatch

step 1 → wait → dispatch step 2 → wait

... × 1,500 steps = 22,500 round-trips

92%+ of time = waiting, not computing

Kernel fusion

dispatch once

→ GPU loops internally

1,500 steps in 1 round-trip

100% of time = computing

Live Demos

Featured

🐦

Flappy Evolution

Watch 50 neural networks learn to play Flappy Bird in real-time. GPU evaluates 4,096 birds per dispatch via kernel fusion. Open in multiple tabs to connect via WebRTC and evolve together.

4,096

birds per GPU dispatch

200+

generations/sec

P2P

WebRTC genome exchange

Paper 1 (kernel fusion)P2P (WebRTC)

📊

Rastrigin Benchmark

4,096-population evolutionary optimization on a 2,000-dimensional multimodal landscape. Measures raw GPU throughput of the fused evolutionary kernel.

real-world median (Apple Silicon)

Paper 1 (kernel fusion)

🧬

Transformer Decoding

Fused attention + FFN + LayerNorm in a single GPU dispatch. Benchmarks unfused, fused, parallel, and f16 variants across model dimensions.

Paper 2 (transformer fusion)

Research

New

Browser-to-Browser Distributed Evolution

WebRTC P2P Genome Exchange for Island-Model Optimization

+11.5%

fitness improvement (4 islands, p=0.015)

+14.6%

cross-platform (RTX 3090, N=5)

install required

Browser tabs form evolutionary islands, exchanging elite genomes directly via WebRTC data channels. A 113-line signaling relay brokers the handshake; all genome data flows peer-to-peer. Private rooms by default. Validated on Rastrigin (N=30), across Apple Metal and NVIDIA Vulkan.

Live Demo Code All Demos

Published

Single-Kernel Fusion for Sequential Fitness Evaluation

via WebGPU Compute Shaders

720×

CUDA over PyTorch (same T4)

159×

WebGPU over PyTorch (same M2)

GPU APIs confirmed

Fusing sequential fitness evaluations into single GPU dispatches eliminates per-step kernel launch overhead. Proven across CUDA, WebGPU, JAX/XLA, and Triton on two hardware platforms.

Preprint Code Benchmarks

Published

Single-Kernel Fusion for Autoregressive Transformer Decoding

via WebGPU Compute Shaders

458×

parallel kernel vs unfused (D=256)

161×

over PyTorch MPS (D=32)

16K

tokens/sec in the browser

Browser LLM engines dispatch 1,024 separate GPU kernels per generation. We fuse everything into one dispatch. Single-threaded: 6.6-13.5×. Parallel kernel (64 threads + shared memory): 66-458×. Beats PyTorch MPS by 7.5-161× at all tested sizes up to D=256. 16,410 tok/s at D=32.

Preprint Code Benchmarks

Real World Results

92Runs · 8 GPU Vendors

Real-world distribution across 7 GPU vendors

Since publishing, 92 unique devices across 7 GPU vendors have run the benchmarks. Medians shown.

71×

Apple Silicon median

peak 226×

56×

NVIDIA median

peak 402×

55×

ARM Mali median

peak 120×

20×

Qualcomm Adreno median

peak 103×

Mobile transformer runs: 36 confirmed across iOS Safari and Android Chrome. Peak: 213.000 tokens/sec on a phone. Average: 15.000 tokens/sec.

Browser coverage: Chrome (347), Firefox (69), Safari (62). macOS, Windows, Linux, Android, iOS. No installation on any of them.

🚀

Why not run it on your device?

30 seconds. No installation. Your result joins the live dataset above.

Run Transformer Benchmark →Run GPU Compute Benchmark

Every result is public

We don't cherry-pick results. Every benchmark run from every device is published in a searchable, sortable, downloadable dataset. GPU name, score, browser, OS, timestamp — all of it. No data is hidden. Verify any claim yourself.

Browse all 92results →

SDK

npm package

@webgpu-fusion/core

One import. One dispatch. All tokens, all layers, all operations fused into a single GPU kernel.

npm install @webgpu-fusion/core

// 3 lines to benchmark your GPU

import { FusedTransformer } from '@webgpu-fusion/core'

const model = await FusedTransformer.create({ dModel: 128, nHeads: 2, nLayers: 4 })

const stats = await model.benchmark({ runs: 10 })

71×

Apple Silicon median

20×

Android phones median

install required

TypeScript. f32 and f16 precision. Int4 quantization. Single-thread and parallel (64-thread shared memory) modes. Works in Chrome, Firefox, Safari — any WebGPU-capable browser.

npm GitHub

Applied

The single-kernel fusion pattern generalises beyond synthetic benchmarks. The flagship application ports a production scientific toolkit — Geant4-DNA (CNRS/IN2P3) — to the browser. Three adjacent projects apply the same pattern to LLM inference, LLM visualisation, and open GPU benchmarking.

Flagship implementationRadiobiology · Monte Carlo

webgpudna.com

Electron track-structure simulation ported from the CNRS/IN2P3 Geant4-DNA toolkit to WebGPU. One thread per primary, full 10 keV history in a single for-loop. Radiolysis chemistry and DNA damage scoring live in a browser tab.The “one dispatch, full history” shape is the same kernel-fusion pattern that gives 3–4 orders of magnitude of speedup on launch-bound workloads — here it's what makes real Monte Carlo radiobiology cheap enough to run live in a browser tab.

0.985×

CSDA vs Geant4-DNA

1.00×

ions/primary vs Geant4

Full tabulated cross sections from G4EMLOW 8.8: Born ionisation, Emfietzoglou excitation, Champion elastic CDF, Sanche vibrational. Karamitros 2011 9-reaction IRT radiolysis. Direct + indirect SSB scoring against a 21×21 parallel B-DNA fiber grid. Validated at 8 energies (100 eV – 20 keV).

See the simulation →

Benchmarks

gpubench.dev

Open WebGPU compute benchmarks across 92 unique devices — Rastrigin, N-body, Monte Carlo Pi, RL environments, transformer fusion. Every result published, no cherry-picking, medians not means.

Benchmark your GPU →LLM inference

zerotvm.com

Phi-3-mini (3.8B) running end-to-end via 10 kernel roles across 27 WGSL files, replacing the 85 TVM-autotuned shaders WebLLM needs. ~40 tok/s on M2 Pro, 22% behind WebLLM.

Run it live →Visualization

neuropulse.live

A real forward pass of Phi-3-mini visualised tensor-by-tensor. 3.8 billion parameters, your GPU, your browser — every glow is a live activation read back from WebGPU. Zero server, zero API key.

Watch it think →Quantum

webgpu-q.vercel.app

Statevector + MPS quantum simulator running on commodity hardware via WebGPU compute. Six-level research ladder from bandwidth-bound statevector through MPS, kernel fusion, WebRTC swarm, IBM hardware cross-verify, to chemistry/VQE. No CUDA, no install.

Open the simulator →Personal

barisgunaydin.com

Personal site and project hub.

About →

Ahmet Baris Gunaydin

Independent Researcher

GitHub LinkedIn webgpudna.com gpubench.dev zerotvm.com neuropulse.live