May 18, 20266 min readassemblysystemsprogramminglowlevelbackend

Marg: Why I'm Writing a Backend Server in x86_64 Assembly

Building Marg, a high-performance backend server using hand-tuned x86_64 assembly for the core engine, featuring HTTP/3 support and an embedded JavaScriptCore runtime for developer ergonomics.

Built this week in Marg: Hand-tuned x86_64 Assembly and the Quest for Zero Overhead

Most modern backend infrastructure is built on layers of abstraction that we’ve collectively agreed are "fast enough." We use Go for its concurrency primitives, Rust for its safety, or C++ when we really want to squeeze the hardware. But even in C++, you are at the mercy of the compiler's register allocation strategy and the overhead of the standard library.

This week, I started Marg. The goal is simple but technically masochistic: build a production-grade backend server where the hot paths are written in hand-tuned x86_64 assembly. We’re talking HTTP/1.1, 2, and 3 (QUIC), TLS 1.3, and an embedded JavaScriptCore engine to make it actually usable for people who don't want to write syscalls for breakfast.

The Philosophy of Marg

Why do this? It's not just an exercise in vanity. Modern CPUs are incredibly complex beasts with branch predictors, out-of-order execution, and deep pipelines. When you write high-level code, you lose the ability to precisely control how the instruction pointer moves through the cache. In Marg, the assembly isn't just a "plugin"—it is the engine.

By writing the core event loop and protocol parsers in assembly, I can ensure that we are using the minimum number of cycles per request. No unnecessary stack frames, no hidden heap allocations, and absolute control over the RAX, RCX, and RDX registers during the most critical paths of the network stack.

The Architecture: Assembly Core, JavaScript Surface

Writing a full business logic layer in assembly is a recipe for security vulnerabilities and developer burnout. That’s why Marg is designed with a hybrid architecture:

  1. The Assembly Engine: Handles socket I/O, epoll/io_uring management, buffer pooling, and the heavy lifting of protocol parsing (HTTP/TLS).
  2. The JavaScriptCore Bridge: I’m embedding WebKit’s JavaScriptCore (JSC). This allows developers to write their actual API logic in JavaScript.
  3. The Easy Wrapper: A high-level JS API that masks the underlying assembly complexity, giving you the performance of raw metal with the ergonomics of Express.js.

This Week's Progress: Realities of QUIC and TLS

I’ve spent the last few days laying out the engineering plan and the roadmap. The initial goal was to hand-roll everything in assembly from day one, including the crypto. However, the complexity of the QUIC state machine and TLS 1.3 handshakes is non-trivial.

In the latest commit, I made a strategic pivot: Use C TLS/QUIC libraries for version 1.0; defer hand-tuned asm crypto/QUIC to the final part.

This is a pragmatic move. To get a working server that people can actually test, I need to leverage existing, audited implementations of BoringSSL or similar for the cryptographic primitives. Once the architectural "plumbing" (the assembly event loop and JSC integration) is stable, I will go back and replace the C shims with hand-optimized assembly routines for specific AEAD ciphers (like AES-GCM or ChaCha20-Poly1305) using AVX-512 or VAES instructions.

The Assembly Event Loop (Conceptual)

In a typical server, the overhead of moving data from the kernel to the application and then into a high-level language's runtime is significant. In Marg, I'm aiming for a zero-copy approach. Here is a simplified look at how the assembly entry point for a connection might look:

section .text
global _start

; The core loop for handling incoming connections
server_loop:
    ; syscall: epoll_wait(epfd, events, maxevents, timeout)
    mov rax, 232             ; sys_epoll_wait
    mov rdi, [rbp - 8]       ; epoll file descriptor
    lea rsi, [rbp - 64]      ; pointer to events struct
    mov rdx, 64              ; max events
    mov r10, -1              ; timeout
    syscall

    test rax, rax            ; check for error or 0 events
    js handle_error
    jz server_loop

    ; Iterate through events
    mov rcx, rax             ; number of events returned
.process_events:
    push rcx
    ; ... logic to determine if it's a new connection or data ...
    ; If data: call the JSC bridge or the HTTP parser
    call handle_request_asm
    pop rcx
    loop .process_events
    jmp server_loop

The JavaScriptCore Integration

Embedding JSC is a deliberate choice over V8. JSC is often more lightweight and has a fantastic JIT (the FTL - Faster Than Light JIT). By bridging the assembly-parsed HTTP headers directly into the JSC heap, we can minimize the "bridge tax" usually associated with native-to-JS calls.

Imagine a JS handler in Marg looking something like this:

// Simple wrapper around the assembly core
Marg.on('request', (req, res) => {
    res.status(200).send({ message: "Hello from the metal!" });
});

Behind the scenes, the res.send call triggers a transition back into the assembly layer, which constructs the HTTP response frame and invokes the write syscall with minimal overhead.

Detailed Engineering Plan

As outlined in the project's new roadmap, the development is split into four distinct phases:

Phase 1: The Foundation (Current)

Phase 2: HTTP/1.1 and 2

Phase 3: The JavaScript Surface

Phase 4: The "Pure Metal" Optimization

Why x86_64 specifically?

While ARM is dominant in mobile and gaining ground in the cloud (Graviton), x86_64 remains the king of the high-performance server market. The instruction set is deep, complex, and provides incredible opportunities for optimization if you know where to look. Features like AVX-512, BMI2, and RDRAND allow for cryptographic and data-processing speeds that are simply unreachable through standard compiler outputs.

Challenges Ahead

The biggest challenge isn't just writing the assembly—it's maintaining it. Debugging a segfault in an assembly-based event loop that is also hosting a JavaScript VM is a nightmare. This is why the engineering plan emphasizes a modular approach. The assembly will be heavily documented, and I'll be using unit tests in C to verify the correctness of each assembly routine before it’s integrated into the main engine.

Another hurdle is the QUIC protocol itself. Unlike TCP, which lives in the kernel, QUIC lives in userspace. This means the server is responsible for congestion control, packet loss detection, and retransmission. Doing this in assembly is a massive undertaking, which justifies why I'm starting with a C library for the 1.0 release.

Conclusion

Marg is currently in its infancy. With 0 stars and 4 commits, it's a project driven by a desire to see how close we can get to the theoretical limits of hardware. It’s about rejecting the bloat of modern web stacks and seeing if we can build something that is not just fast, but fundamentally efficient.

If you're interested in low-level systems programming, assembly optimization, or the internals of web protocols, keep an eye on the repository. The roadmap is set, the initial commits are in, and the real work of hand-tuning the backend has begun.

Stay tuned for next week's update, where I'll dive deeper into the assembly HTTP/1.1 parser and how I'm handling memory mapping for the JavaScriptCore engine.


Discussion0

Markdown supported. Rate-limited to 5 / minute. 0/5000

Be the first to comment.