Built this week in chitra-video-editor: Browser-native AI, SAM2 Rotos, and Chat-Driven Editing

Video editing is historically a heavy-client game. You either suffer through the latency of a cloud-based timeline or you download a multi-gigabyte binary that eats your RAM. With chitra-video-editor, I'm betting on a third path: a browser-native editor that leverages local power where it counts, backed by a thin, high-performance Rust sidecar for the heavy lifting.

This week was about closing the loop on the most ambitious part of the stack: Segment Anything Model 2 (SAM2) integration for rotoscoping, hardening the Chat-to-Edit pipeline, and ensuring that our Edit Action Language (EAL) remains atomic and reproducible.

The Architecture: Why Rust + Axum + TypeScript?

I recently made a pivot in the architecture. I dropped the Tauri wrapper in favor of a clean Web/Axum split. Why? Because the browser's sandbox is actually an asset for media playback, and the overhead of managing a custom webview window wasn't paying for itself.

The backend is a high-concurrency Rust server using Tokio and Axum. It handles the "heavy" tasks:

WhisperX Transcriptions: Using Silero VAD (Voice Activity Detection) to prevent the hallucinations common in quiet segments.
EfficientTAM Sidecar: A Python-based sidecar for SAM2/EfficientTAM inference that talks to our Rust backend via an /api/segment endpoint.
FFmpeg Orchestration: Handling the final export parity when we need to move from the browser's canvas preview to a hard-rendered file.

SAM2 Phase 1: The Rotoscoping Pipeline

Rotoscoping is the bane of every editor's existence. Traditionally, it involves frame-by-frame masking. With SAM2, we can click once and track through time. Implementing this in a browser-native environment required a multi-slice approach.

Slice 1 & 2: The Data Model and IndexedDB

We can't store mask data in memory. A 1080p mask at 30fps is too heavy for the JavaScript heap if the clip is long. I implemented MASK_STORE using IndexedDB for persistence.

In the core logic, I introduced TimelineClip.mask and a new EAL opcode. The Edit Action Language (EAL) is the heart of Chitra. It’s a domain-specific language that describes every edit as an atomic operation.

// Example of an EAL Opcode for Masking
{
  op: "APPLY_MASK",
  clipId: "v1_002",
  maskDataId: "uuid-indexeddb-key",
  effects: ["blur-bg", "spotlight"]
}

By promoting the mask to an atomic opcode, we ensure that the AI chat agent can manipulate masks just as easily as it can cut a clip.

Slice 3: The Render Loop

Rendering masks in real-time on a <canvas> while maintaining 60fps is a challenge. I implemented three primary mask-driven effects this week:

Spotlight: Darkening everything outside the mask.
Cutout: Removing the background entirely.
Blur-BG: Applying a Gaussian blur to the non-masked area.

To achieve parity between the browser preview and the FFmpeg export, I had to write custom filter strings for FFmpeg that mirror the CSS/Canvas composite operations.

Making "Cut the Dead Air" Actually Work

One of the most requested features in AI video tools is removing filler words and silence. It sounds easy—just look at the waveform, right? Wrong. If you just cut based on decibels, you get clipping, jarring transitions, and you often cut off the start of the next sentence.

I hardened the subtitle pipeline using a "half-open boundary" approach. We use Silero VAD to get precise voice activity segments, then we cross-reference those with WhisperX word-level timestamps.

When the user types "remove the dead air" in the chat, the AI doesn't just guess. It triggers a tool call that looks at the VAD gaps.

Optimized Chat Prompt Caching

AI editing can feel slow if the LLM has to re-read the entire project state every time. I implemented a stable-prefix-first context layout. By keeping the project schema and the most recent transcriptions at the top of the prompt, we maximize KV-cache hits in the LLM provider, reducing the "time to first edit" significantly.

The Subtitle Pipeline: Beyond Text on Screen

Subtitles in Chitra aren't just an afterthought; they are a first-class citizen in the timeline. This week’s work focused on:

Dense Word Cues: Mapping every single word to a timeline position for "karaoke-style" highlighting.
Bulk Shift: If you move a clip, the subtitles must move with it. This is handled by the EAL indexing layer.
Typography: Added rotation, overlap prevention, and a slate-palette UI for the inspector.

Engineering Insights: The Complexity of "Fidelity"

One of the biggest hurdles was Playback Fidelity. The browser's <video> element is notoriously difficult to seek frame-accurately. To solve this, I’m using a viewport virtualization strategy. We only render what is visible on the timeline, and we pre-fetch frames into a buffer when the playhead is approaching a clip boundary.

We also had to deal with "cached edits dropping tool calls." When the AI suggests a series of edits (e.g., "Zoom in on the speaker and add a caption"), those are often multiple tool calls. If the cache was stale, it would execute the zoom but lose the caption. I refactored the apply_eal pipeline to be strictly sequential and transactional. If one part of the AI's suggestion fails, the whole block rolls back.

The Professional UI Pass

I’m a firm believer that internal tools shouldn't look like internal tools. I moved the UI to a high-density "Slate" palette.

Minimalist Sliders: One-per-row inspector sliders for position, scale, and rotation.
Vertical Centering: Dropped the old position slider in favor of a vertically centered thumb on the track, which feels much more like a professional NLE (Non-Linear Editor).
Gradient Clips: Visual cues on the timeline to show where AI effects are applied.

What's Next?

With SAM2 Phase 1 complete, the focus shifts to Phase 2: Interactive Refinement. Currently, the tracking is a "fire and forget" process. I want users to be able to correct the mask at specific keyframes and have SAM2 propagate those corrections forward and backward.

We also have a 7-minute AI-edit harness in the works. This will allow the editor to ingest long-form footage and automatically generate a high-quality "rough cut" based on a text prompt, complete with b-roll suggestions and automated zoom-ins on active speakers.

Building in public is about showing the messy parts—the 41 commits of fighting with FFmpeg filter strings and IndexedDB race conditions. But seeing a SAM2 mask follow a subject perfectly in a Chrome tab makes the struggle worth it.

Check out the progress on GitHub. No stars yet, just pure engineering.

Building a Browser-Native AI Video Editor with SAM2 and EAL

Built this week in chitra-video-editor: Browser-native AI, SAM2 Rotos, and Chat-Driven Editing

The Architecture: Why Rust + Axum + TypeScript?

SAM2 Phase 1: The Rotoscoping Pipeline

Slice 1 & 2: The Data Model and IndexedDB

Slice 3: The Render Loop

Making "Cut the Dead Air" Actually Work

Optimized Chat Prompt Caching

The Subtitle Pipeline: Beyond Text on Screen

Engineering Insights: The Complexity of "Fidelity"

The Professional UI Pass

What's Next?

Discussion0

Built this week in chitra-video-editor: Browser-native AI, SAM2 Rotos, and Chat-Driven Editing

The Architecture: Why Rust + Axum + TypeScript?

SAM2 Phase 1: The Rotoscoping Pipeline

Slice 1 & 2: The Data Model and IndexedDB

Slice 3: The Render Loop

Making "Cut the Dead Air" Actually Work

Optimized Chat Prompt Caching

The Subtitle Pipeline: Beyond Text on Screen

Engineering Insights: The Complexity of "Fidelity"

The Professional UI Pass

What's Next?

Related posts