Real-ESRGAN in the browser at 50ms/frame
NVC ships with two SR runners:
- CLI —
nvc decode --enhancer realesrganshells out torealesrgan-ncnn-vulkanwith the bundledrealesr-animevideov3model (~1.2 MB). - Browser — used to run
MOD0TinySR (~5 KB, 4500 parameters) on WebGPU.
That's a 240× model-size gap. You felt it: a frame decoded in the CLI looked noticeably better than the same frame in the browser. The neural-augmented codec was running its full pipeline only on one of two surfaces.
I closed that gap with ONNX Runtime Web + WebGPU.
The export
ml/export_realesrgan_onnx.py:
- Downloads upstream
realesr-animevideov3.pth(2.5 MB, x4-trained weights from thexinntao/Real-ESRGANreleases). - Reconstructs the SRVGGNetCompact architecture in PyTorch (3 → 64 → 16× conv-PReLU → 48 → pixel_shuffle×4).
- Loads the weights, exports to ONNX with dynamic spatial axes, opset 17.
- Re-packs sidecar weights inline so the file is self-contained (the new dynamo exporter externalizes by default — undesirable for browser shipping).
- Runs a torch-vs-onnxruntime sanity check on a fixed input — max abs diff was 0.000014, well within float32 noise.
Output: ml/exports/nvc-realesrgan-anime-x4.onnx, 2.5 MB single file.
The runtime
web/src/nvc.ts lazy-loads onnxruntime-web only when the user enters Neural mode (saves the ~10 MB JS+wasm cost for users who never trigger SR). The runner:
const session = await ort.InferenceSession.create(REALESRGAN_ONNX_URL, {
executionProviders: ["webgpu", "wasm"], // try WebGPU, fall back to WASM
});
Frame pipeline:
- Decode VP9 base → RGB at base resolution (e.g. 540p for W1).
- Convert RGB → planar float32 NCHW.
session.run({lr_rgb: tensor})→ 4× upscaled frame.- Bicubic-downsample the 4× output to source resolution via OffscreenCanvas (matches CLI's
-vf scale=...:flags=lanczos). - Display.
Two CLI-parity bugs I fixed along the way
When I first wired this up, I had two subtle bugs that were hurting quality vs the CLI:
1. Pre-shrinking the input. I had a halveBaseFrame step that downsampled 540p to 270p before feeding the x4 model. Reasoning at the time: "make x4 weights effectively x2." Side effect: throwing away half the input information Real-ESRGAN never gets to use. Fix: run the model at full input resolution (540p → 2160p) and bicubic-downsample the output.
2. Codec-base luma blending. The legacy applyReconstructionTuning blended 72% of the base luma into the SR output. Made sense for the 5K-param TinySR (which can't recover luma well on its own); actively hurt Real-ESRGAN (which can). Fix: skip applyReconstructionTuning on the ORT path. CLI's --enhancer realesrgan doesn't blend FET1/COL1/GRN1 either.
After both fixes, browser Neural-mode output on a still frame is visually indistinguishable from the CLI's nvc decode --enhancer realesrgan output.
Performance on Apple M5
- WebGPU EP: 50–100 ms/frame at 540p input → 1080p output. WebGPU compute shaders generated at session create.
- WASM EP fallback: ~500 ms/frame. Still usable for static-frame inspection.
For playback, the player pre-decodes all VP9 frames into an ImageData cache (so Codec mode plays at native fps), and Neural mode is reserved for paused-frame inspection. A small LRU caches recent Neural outputs for instant re-seek.
Why this matters
The browser experience now matches the CLI experience. The .nvc file you encode and post on the web doesn't degrade when someone watches it in NVC Studio vs running the CLI offline — both routes use the same model class with the same input resolution. That parity is what makes the codec actually shippable for web video.